[med-svn] [Git][med-team/augur][master] 4 commits: routine-update: New upstream version

Sun Jul 3 16:53:16 BST 2022


Étienne Mollier pushed to branch master at Debian Med / augur


Commits:
fe35c1f9 by Étienne Mollier at 2022-07-03T17:46:32+02:00
routine-update: New upstream version

- - - - -
5297b911 by Étienne Mollier at 2022-07-03T17:46:33+02:00
New upstream version 16.0.2
- - - - -
219e80c3 by Étienne Mollier at 2022-07-03T17:47:23+02:00
Update upstream source from tag 'upstream/16.0.2'

Update to upstream version '16.0.2'
with Debian dir e124fe0183f60c73ae1ab9aae4dc63bc69aa1353
- - - - -
9844f882 by Étienne Mollier at 2022-07-03T17:51:32+02:00
routine-update: Ready to upload to unstable

- - - - -


11 changed files:

- CHANGES.md
- augur/__version__.py
- augur/ancestral.py
- augur/filter.py
- − augur/io_support/__init__.py
- debian/changelog
- tests/builds/zika/results/nt_muts.json
- tests/functional/ancestral.t
- tests/functional/ancestral/tree_raw.nwk
- + tests/functional/filter/cram/filter-force-include-no-duplicates.t
- + tests/functional/filter/cram/subsample-group-by-with-custom-year-column.t


Changes:

=====================================
CHANGES.md
=====================================
@@ -3,6 +3,22 @@
 ## __NEXT__
 
 
+## 16.0.2 (30 June 2022)
+
+### Bug Fixes
+
+* The entropy panel was unavailable if mutations were not translated [#881][]. This has been fixed by creating an additional `annotations` block in `augur ancestral` containing (nucleotide) genome annotations in the node-data [#961][] (@jameshadfield)
+* ancestral: WARNINGs to stdout have been updated to print to stderr [#961][] (@jameshadfield)
+* filter: Explicitly drop date/year/month columns from metadata during grouping. [#967][] (@victorlin)
+    * This fixes a bug [#871][] where `augur filter` would crash with a cryptic `ValueError` if `year` and/or `month` is a custom column in the input metadata and also included in `--group-by`.
+* filter: Fix duplicates that may appear in metadata when using `--include`/`--include-where` with subsampling [#986][] (@victorlin)
+
+[#881]: https://github.com/nextstrain/augur/issues/881
+[#961]: https://github.com/nextstrain/augur/pull/961
+[#967]: https://github.com/nextstrain/augur/pull/967
+[#871]: https://github.com/nextstrain/augur/issues/871
+[#986]: https://github.com/nextstrain/augur/pull/986
+
 ## 16.0.1 (21 June 2022)
 
 ### Bug Fixes


=====================================
augur/__version__.py
=====================================
@@ -1,4 +1,4 @@
-__version__ = '16.0.1'
+__version__ = '16.0.2'
 
 
 def is_augur_version_compatible(version):


=====================================
augur/ancestral.py
=====================================
@@ -124,7 +124,7 @@ def register_arguments(parser):
     parser.add_argument('--output-sequences', type=str, help='name of FASTA file to save ancestral sequences to (FASTA alignments only)')
     parser.add_argument('--inference', default='joint', choices=["joint", "marginal"],
                                     help="calculate joint or marginal maximum likelihood ancestral sequence states")
-    parser.add_argument('--vcf-reference', type=str, help='fasta file of the sequence the VCF was mapped to')
+    parser.add_argument('--vcf-reference', type=str, help='fasta file of the sequence the VCF was mapped to (only used if a VCF is provided as the alignment)')
     parser.add_argument('--output-vcf', type=str, help='name of output VCF file which will include ancestral seqs')
     ambiguous = parser.add_mutually_exclusive_group()
     ambiguous.add_argument('--keep-ambiguous', action="store_true",
@@ -136,7 +136,7 @@ def register_arguments(parser):
 
 def run(args):
     # check alignment type, set flags, read in if VCF
-    is_vcf = False
+    is_vcf = any([args.alignment.lower().endswith(x) for x in ['.vcf', '.vcf.gz']])
     ref = None
     anc_seqs = {}
 
@@ -149,22 +149,21 @@ def run(args):
     import numpy as np
     missing_internal_node_names = [n.name is None for n in T.get_nonterminals()]
     if np.all(missing_internal_node_names):
-        print("\n*** WARNING: Tree has no internal node names!")
-        print("*** Without internal node names, ancestral sequences can't be linked up to the correct node later.")
-        print("*** If you want to use 'augur export' or `augur translate` later, re-run this command with the output of 'augur refine'.")
-        print("*** If you haven't run 'augur refine', you can add node names to your tree by running:")
-        print("*** augur refine --tree %s --output-tree <filename>.nwk"%(args.tree) )
-        print("*** And use <filename>.nwk as the tree when running 'ancestral', 'translate', and 'traits'")
-
-    if any([args.alignment.lower().endswith(x) for x in ['.vcf', '.vcf.gz']]):
+        print("\n*** WARNING: Tree has no internal node names!", file=sys.stderr)
+        print("*** Without internal node names, ancestral sequences can't be linked up to the correct node later.", file=sys.stderr)
+        print("*** If you want to use 'augur export' or `augur translate` later, re-run this command with the output of 'augur refine'.", file=sys.stderr)
+        print("*** If you haven't run 'augur refine', you can add node names to your tree by running:", file=sys.stderr)
+        print("*** augur refine --tree %s --output-tree <filename>.nwk"%(args.tree) , file=sys.stderr)
+        print("*** And use <filename>.nwk as the tree when running 'ancestral', 'translate', and 'traits'", file=sys.stderr)
+
+    if is_vcf:
         if not args.vcf_reference:
-            print("ERROR: a reference Fasta is required with VCF-format alignments")
+            print("ERROR: a reference Fasta is required with VCF-format alignments", file=sys.stderr)
             return 1
 
         compress_seq = read_vcf(args.alignment, args.vcf_reference)
         aln = compress_seq['sequences']
         ref = compress_seq['reference']
-        is_vcf = True
     else:
         aln = args.alignment
 
@@ -172,7 +171,7 @@ def run(args):
     from distutils.version import StrictVersion
     import treetime
     if StrictVersion(treetime.version) < StrictVersion('0.7.0'):
-        print("ERROR: this version of augur requires TreeTime 0.7 or later.")
+        print("ERROR: this version of augur requires TreeTime 0.7 or later.", file=sys.stderr)
         return 1
 
     # Infer ambiguous bases if the user has requested that we infer them (either
@@ -208,6 +207,8 @@ def run(args):
     if anc_seqs.get("mask") is not None:
         anc_seqs["mask"] = "".join(['1' if x else '0' for x in anc_seqs["mask"]])
 
+    anc_seqs['annotations'] = {'nuc': {'start': 1, 'end': len(anc_seqs['reference']['nuc']), 'strand': '+'}}
+
     out_name = get_json_name(args, '.'.join(args.alignment.split('.')[:-1]) + '_mutations.json')
     write_json(anc_seqs, out_name)
     print("ancestral mutations written to", out_name, file=sys.stdout)


=====================================
augur/filter.py
=====================================
@@ -931,6 +931,14 @@ def get_groups_for_subsampling(strains, metadata, group_by=None):
 
     # date requested
     if 'year' in group_by_set or 'month' in group_by_set:
+
+        if 'year' in metadata.columns and 'year' in group_by_set:
+            print(f"WARNING: `--group-by year` uses the generated year value from the 'date' column. The custom 'year' column in the metadata is ignored for grouping purposes.", file=sys.stderr)
+            metadata.drop('year', axis=1, inplace=True)
+        if 'month' in metadata.columns and 'month' in group_by_set:
+            print(f"WARNING: `--group-by month` uses the generated month value from the 'date' column. The custom 'month' column in the metadata is ignored for grouping purposes.", file=sys.stderr)
+            metadata.drop('month', axis=1, inplace=True)
+
         if 'date' not in metadata:
             # set year/month/day = unknown
             print(f"WARNING: A 'date' column could not be found to group-by year or month.", file=sys.stderr)
@@ -1149,7 +1157,10 @@ def register_arguments(parser):
     sequence_filter_group.add_argument('--non-nucleotide', action='store_true', help="exclude sequences that contain illegal characters")
 
     subsample_group = parser.add_argument_group("subsampling", "options to subsample filtered data")
-    subsample_group.add_argument('--group-by', nargs='+', help="categories with respect to subsample; two virtual fields, \"month\" and \"year\", are supported if they don't already exist as real fields but a \"date\" field does exist")
+    subsample_group.add_argument('--group-by', nargs='+', help="""
+        categories with respect to subsample.
+        Grouping by 'year' and/or 'month' is only supported when there is a 'date' column in the metadata.
+        Custom 'year' and 'month' columns in the metadata are ignored for grouping. Please rename them if you want to use their values for grouping.""")
     subsample_limits_group = subsample_group.add_mutually_exclusive_group()
     subsample_limits_group.add_argument('--sequences-per-group', type=int, help="subsample to no more than this number of sequences per category")
     subsample_limits_group.add_argument('--subsample-max-sequences', type=int, help="subsample to no more than this number of sequences; can be used without the group_by argument")
@@ -1398,11 +1409,11 @@ def run(args):
 
         # Track distinct strains to include, so we can write their
         # corresponding metadata, strains, or sequences later, as needed..
-        distinct_sequences_to_include = {
+        distinct_force_included_strains = {
             record["strain"]
             for record in sequences_to_include
         }
-        all_sequences_to_include.update(distinct_sequences_to_include)
+        all_sequences_to_include.update(distinct_force_included_strains)
 
         # Track reasons for filtered or force-included strains, so we can
         # report total numbers filtered and included at the end. Optionally,
@@ -1418,6 +1429,10 @@ def run(args):
                 output_log_writer.writerow(filtered_strain)
 
         if group_by:
+            # Prevent force-included sequences from being included again during
+            # subsampling.
+            seq_keep = seq_keep - distinct_force_included_strains
+
             # If grouping, track the highest priority metadata records or
             # count the number of records per group. First, we need to get
             # the groups for the given records.
@@ -1466,13 +1481,13 @@ def run(args):
         # Always write out strains that are force-included. Additionally, if
         # we are not grouping, write out metadata and strains that passed
         # filters so far.
-        strains_to_write = distinct_sequences_to_include
+        force_included_strains_to_write = distinct_force_included_strains
         if not group_by:
-            strains_to_write = strains_to_write | seq_keep
+            force_included_strains_to_write = force_included_strains_to_write | seq_keep
 
         if args.output_metadata:
             # TODO: wrap logic to write metadata into its own function
-            metadata.loc[list(strains_to_write)].to_csv(
+            metadata.loc[list(force_included_strains_to_write)].to_csv(
                 args.output_metadata,
                 sep="\t",
                 header=metadata_header,
@@ -1484,7 +1499,7 @@ def run(args):
         if args.output_strains:
             # TODO: Output strains will no longer be ordered. This is a
             # small breaking change.
-            for strain in strains_to_write:
+            for strain in force_included_strains_to_write:
                 output_strains.write(f"{strain}\n")
 
     # In the worst case, we need to calculate sequences per group from the


=====================================
augur/io_support/__init__.py deleted
=====================================


=====================================
debian/changelog
=====================================
@@ -1,3 +1,9 @@
+augur (16.0.2-1) unstable; urgency=medium
+
+  * New upstream version
+
+ -- Étienne Mollier <emollier at debian.org>  Sun, 03 Jul 2022 17:47:31 +0200
+
 augur (16.0.1-1) unstable; urgency=medium
 
   * d/control: add myself to Uploaders.


=====================================
tests/builds/zika/results/nt_muts.json
=====================================
@@ -1,4 +1,11 @@
 {
+  "annotations": {
+    "nuc": {
+      "end": 10769,
+      "start": 1,
+      "strand": "+"
+    }
+  },
   "generated_by": {
     "program": "augur",
     "version": "7.0.2"


=====================================
tests/functional/ancestral.t
=====================================
@@ -15,6 +15,16 @@ The default is to infer ambiguous bases, so there should not be N bases in the i
   $ grep N "$TMP/ancestral_sequences.fasta"
   >NODE_0000000
 
+Check that the reference length was correctly exported as the nuc annotation
+  $ grep -A 6 'annotations' "$TMP/ancestral_mutations.json"
+    "annotations": {
+      "nuc": {
+        "end": 10769,
+        "start": 1,
+        "strand": "+"
+      }
+    },
+
 Infer ancestral sequences for the given tree and alignment, explicitly requesting that ambiguous bases are inferred.
 There should not be N bases in the inferred output sequences.
 


=====================================
tests/functional/ancestral/tree_raw.nwk
=====================================
@@ -1 +1 @@
-(KX369547.1:0.00018895,1_0087_PF:0.00047255,1_0181_PF:0.00018905):0.00000000;
+(KX369547.1:0.00018895,1_0087_PF:0.00047255,1_0181_PF:0.00018905)NODE_0000000:0.00000000;


=====================================
tests/functional/filter/cram/filter-force-include-no-duplicates.t
=====================================
@@ -0,0 +1,95 @@
+Setup
+
+  $ pushd "$TESTDIR" > /dev/null
+  $ source _setup.sh
+
+
+Test that a force-included strain is only output once.
+
+
+Create some files for testing.
+
+  $ cat >$TMP/metadata.tsv <<~~
+  > strain	col
+  > a	1
+  > b	2
+  > c	3
+  > d	4
+  > ~~
+  $ cat >$TMP/sequences.fasta <<~~
+  > >a
+  > NNNN
+  > >b
+  > NNNN
+  > >c
+  > NNNN
+  > >d
+  > NNNN
+  > ~~
+
+Test all outputs with --include-where.
+
+  $ ${AUGUR} filter \
+  >   --metadata $TMP/metadata.tsv \
+  >   --sequences $TMP/sequences.fasta \
+  >   --subsample-max-sequences 4 \
+  >   --include-where col=1 \
+  >   --subsample-seed 0 \
+  >   --output-metadata $TMP/metadata-filtered.tsv \
+  >   --output-strains $TMP/strains-filtered.txt \
+  >   --output-sequences $TMP/sequences-filtered.fasta \
+  >   > /dev/null 2>&1
+  $ cat $TMP/metadata-filtered.tsv | tail -n+2 | sort -k1
+  a\t1 (esc)
+  b\t2 (esc)
+  c\t3 (esc)
+  d\t4 (esc)
+  $ cat $TMP/strains-filtered.txt | sort
+  a
+  b
+  c
+  d
+  $ cat $TMP/sequences-filtered.fasta
+  >a
+  NNNN
+  >b
+  NNNN
+  >c
+  NNNN
+  >d
+  NNNN
+
+Test all outputs with --include.
+
+  $ cat >$TMP/include.txt <<~~
+  > a
+  > ~~
+  $ ${AUGUR} filter \
+  >   --metadata $TMP/metadata.tsv \
+  >   --sequences $TMP/sequences.fasta \
+  >   --subsample-max-sequences 4 \
+  >   --include $TMP/include.txt \
+  >   --subsample-seed 0 \
+  >   --output-metadata $TMP/metadata-filtered.tsv \
+  >   --output-strains $TMP/strains-filtered.txt \
+  >   --output-sequences $TMP/sequences-filtered.fasta \
+  >   > /dev/null 2>&1
+  $ cat $TMP/metadata-filtered.tsv | tail -n+2 | sort -k1
+  a\t1 (esc)
+  b\t2 (esc)
+  c\t3 (esc)
+  d\t4 (esc)
+  $ cat $TMP/strains-filtered.txt | sort
+  a
+  b
+  c
+  d
+  $ cat $TMP/sequences-filtered.fasta
+  >a
+  NNNN
+  >b
+  NNNN
+  >c
+  NNNN
+  >d
+  NNNN


=====================================
tests/functional/filter/cram/subsample-group-by-with-custom-year-column.t
=====================================
@@ -0,0 +1,45 @@
+Setup
+
+  $ pushd "$TESTDIR" > /dev/null
+  $ source _setup.sh
+
+Create a metadata file with a custom year column
+
+  $ cat >$TMP/metadata-year-column.tsv <<~~
+  > strain	date	year	month
+  > SEQ1	2021-01-01	odd	January
+  > SEQ2	2021-01-02	odd	January
+  > SEQ3	2022-01-01	even	January
+  > SEQ4	2022-01-02	even	January
+  > SEQ5	2022-02-02	even	February
+  > ~~
+
+Group by generated year column, and ensure all original columns are still in the final output.
+
+  $ ${AUGUR} filter \
+  >  --metadata $TMP/metadata-year-column.tsv \
+  >  --group-by year \
+  >  --sequences-per-group 1 \
+  >  --subsample-seed 0 \
+  >  --output-metadata "$TMP/filtered_metadata.tsv" > /dev/null
+  WARNING: `--group-by year` uses the generated year value from the 'date' column. The custom 'year' column in the metadata is ignored for grouping purposes.
+  $ cat "$TMP/filtered_metadata.tsv"
+  strain\tdate\tyear\tmonth (esc)
+  SEQ1\t2021-01-01\todd\tJanuary (esc)
+  SEQ5\t2022-02-02\teven\tFebruary (esc)
+
+Group by generated year and month columns, and ensure all original columns are still in the final output.
+
+  $ ${AUGUR} filter \
+  >  --metadata $TMP/metadata-year-column.tsv \
+  >  --group-by year month \
+  >  --sequences-per-group 1 \
+  >  --subsample-seed 0 \
+  >  --output-metadata "$TMP/filtered_metadata.tsv" > /dev/null
+  WARNING: `--group-by year` uses the generated year value from the 'date' column. The custom 'year' column in the metadata is ignored for grouping purposes.
+  WARNING: `--group-by month` uses the generated month value from the 'date' column. The custom 'month' column in the metadata is ignored for grouping purposes.
+  $ cat "$TMP/filtered_metadata.tsv"
+  strain\tdate\tyear\tmonth (esc)
+  SEQ1\t2021-01-01\todd\tJanuary (esc)
+  SEQ3\t2022-01-01\teven\tJanuary (esc)
+  SEQ5\t2022-02-02\teven\tFebruary (esc)



View it on GitLab: https://salsa.debian.org/med-team/augur/-/compare/0b6a744aefe0e50d30d6f2b5519affc06f0497c1...9844f882ad40b8f3a85023724e616f0d7ea29275

-- 
View it on GitLab: https://salsa.debian.org/med-team/augur/-/compare/0b6a744aefe0e50d30d6f2b5519affc06f0497c1...9844f882ad40b8f3a85023724e616f0d7ea29275
You're receiving this email because of your account on salsa.debian.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20220703/d8a092b3/attachment-0001.htm>