[med-svn] [Git][med-team/augur][master] 4 commits: routine-update: New upstream version
Étienne Mollier (@emollier)
gitlab at salsa.debian.org
Sun Jul 3 16:53:16 BST 2022
Étienne Mollier pushed to branch master at Debian Med / augur
Commits:
fe35c1f9 by Étienne Mollier at 2022-07-03T17:46:32+02:00
routine-update: New upstream version
- - - - -
5297b911 by Étienne Mollier at 2022-07-03T17:46:33+02:00
New upstream version 16.0.2
- - - - -
219e80c3 by Étienne Mollier at 2022-07-03T17:47:23+02:00
Update upstream source from tag 'upstream/16.0.2'
Update to upstream version '16.0.2'
with Debian dir e124fe0183f60c73ae1ab9aae4dc63bc69aa1353
- - - - -
9844f882 by Étienne Mollier at 2022-07-03T17:51:32+02:00
routine-update: Ready to upload to unstable
- - - - -
11 changed files:
- CHANGES.md
- augur/__version__.py
- augur/ancestral.py
- augur/filter.py
- − augur/io_support/__init__.py
- debian/changelog
- tests/builds/zika/results/nt_muts.json
- tests/functional/ancestral.t
- tests/functional/ancestral/tree_raw.nwk
- + tests/functional/filter/cram/filter-force-include-no-duplicates.t
- + tests/functional/filter/cram/subsample-group-by-with-custom-year-column.t
Changes:
=====================================
CHANGES.md
=====================================
@@ -3,6 +3,22 @@
## __NEXT__
+## 16.0.2 (30 June 2022)
+
+### Bug Fixes
+
+* The entropy panel was unavailable if mutations were not translated [#881][]. This has been fixed by creating an additional `annotations` block in `augur ancestral` containing (nucleotide) genome annotations in the node-data [#961][] (@jameshadfield)
+* ancestral: WARNINGs to stdout have been updated to print to stderr [#961][] (@jameshadfield)
+* filter: Explicitly drop date/year/month columns from metadata during grouping. [#967][] (@victorlin)
+ * This fixes a bug [#871][] where `augur filter` would crash with a cryptic `ValueError` if `year` and/or `month` is a custom column in the input metadata and also included in `--group-by`.
+* filter: Fix duplicates that may appear in metadata when using `--include`/`--include-where` with subsampling [#986][] (@victorlin)
+
+[#881]: https://github.com/nextstrain/augur/issues/881
+[#961]: https://github.com/nextstrain/augur/pull/961
+[#967]: https://github.com/nextstrain/augur/pull/967
+[#871]: https://github.com/nextstrain/augur/issues/871
+[#986]: https://github.com/nextstrain/augur/pull/986
+
## 16.0.1 (21 June 2022)
### Bug Fixes
=====================================
augur/__version__.py
=====================================
@@ -1,4 +1,4 @@
-__version__ = '16.0.1'
+__version__ = '16.0.2'
def is_augur_version_compatible(version):
=====================================
augur/ancestral.py
=====================================
@@ -124,7 +124,7 @@ def register_arguments(parser):
parser.add_argument('--output-sequences', type=str, help='name of FASTA file to save ancestral sequences to (FASTA alignments only)')
parser.add_argument('--inference', default='joint', choices=["joint", "marginal"],
help="calculate joint or marginal maximum likelihood ancestral sequence states")
- parser.add_argument('--vcf-reference', type=str, help='fasta file of the sequence the VCF was mapped to')
+ parser.add_argument('--vcf-reference', type=str, help='fasta file of the sequence the VCF was mapped to (only used if a VCF is provided as the alignment)')
parser.add_argument('--output-vcf', type=str, help='name of output VCF file which will include ancestral seqs')
ambiguous = parser.add_mutually_exclusive_group()
ambiguous.add_argument('--keep-ambiguous', action="store_true",
@@ -136,7 +136,7 @@ def register_arguments(parser):
def run(args):
# check alignment type, set flags, read in if VCF
- is_vcf = False
+ is_vcf = any([args.alignment.lower().endswith(x) for x in ['.vcf', '.vcf.gz']])
ref = None
anc_seqs = {}
@@ -149,22 +149,21 @@ def run(args):
import numpy as np
missing_internal_node_names = [n.name is None for n in T.get_nonterminals()]
if np.all(missing_internal_node_names):
- print("\n*** WARNING: Tree has no internal node names!")
- print("*** Without internal node names, ancestral sequences can't be linked up to the correct node later.")
- print("*** If you want to use 'augur export' or `augur translate` later, re-run this command with the output of 'augur refine'.")
- print("*** If you haven't run 'augur refine', you can add node names to your tree by running:")
- print("*** augur refine --tree %s --output-tree <filename>.nwk"%(args.tree) )
- print("*** And use <filename>.nwk as the tree when running 'ancestral', 'translate', and 'traits'")
-
- if any([args.alignment.lower().endswith(x) for x in ['.vcf', '.vcf.gz']]):
+ print("\n*** WARNING: Tree has no internal node names!", file=sys.stderr)
+ print("*** Without internal node names, ancestral sequences can't be linked up to the correct node later.", file=sys.stderr)
+ print("*** If you want to use 'augur export' or `augur translate` later, re-run this command with the output of 'augur refine'.", file=sys.stderr)
+ print("*** If you haven't run 'augur refine', you can add node names to your tree by running:", file=sys.stderr)
+ print("*** augur refine --tree %s --output-tree <filename>.nwk"%(args.tree) , file=sys.stderr)
+ print("*** And use <filename>.nwk as the tree when running 'ancestral', 'translate', and 'traits'", file=sys.stderr)
+
+ if is_vcf:
if not args.vcf_reference:
- print("ERROR: a reference Fasta is required with VCF-format alignments")
+ print("ERROR: a reference Fasta is required with VCF-format alignments", file=sys.stderr)
return 1
compress_seq = read_vcf(args.alignment, args.vcf_reference)
aln = compress_seq['sequences']
ref = compress_seq['reference']
- is_vcf = True
else:
aln = args.alignment
@@ -172,7 +171,7 @@ def run(args):
from distutils.version import StrictVersion
import treetime
if StrictVersion(treetime.version) < StrictVersion('0.7.0'):
- print("ERROR: this version of augur requires TreeTime 0.7 or later.")
+ print("ERROR: this version of augur requires TreeTime 0.7 or later.", file=sys.stderr)
return 1
# Infer ambiguous bases if the user has requested that we infer them (either
@@ -208,6 +207,8 @@ def run(args):
if anc_seqs.get("mask") is not None:
anc_seqs["mask"] = "".join(['1' if x else '0' for x in anc_seqs["mask"]])
+ anc_seqs['annotations'] = {'nuc': {'start': 1, 'end': len(anc_seqs['reference']['nuc']), 'strand': '+'}}
+
out_name = get_json_name(args, '.'.join(args.alignment.split('.')[:-1]) + '_mutations.json')
write_json(anc_seqs, out_name)
print("ancestral mutations written to", out_name, file=sys.stdout)
=====================================
augur/filter.py
=====================================
@@ -931,6 +931,14 @@ def get_groups_for_subsampling(strains, metadata, group_by=None):
# date requested
if 'year' in group_by_set or 'month' in group_by_set:
+
+ if 'year' in metadata.columns and 'year' in group_by_set:
+ print(f"WARNING: `--group-by year` uses the generated year value from the 'date' column. The custom 'year' column in the metadata is ignored for grouping purposes.", file=sys.stderr)
+ metadata.drop('year', axis=1, inplace=True)
+ if 'month' in metadata.columns and 'month' in group_by_set:
+ print(f"WARNING: `--group-by month` uses the generated month value from the 'date' column. The custom 'month' column in the metadata is ignored for grouping purposes.", file=sys.stderr)
+ metadata.drop('month', axis=1, inplace=True)
+
if 'date' not in metadata:
# set year/month/day = unknown
print(f"WARNING: A 'date' column could not be found to group-by year or month.", file=sys.stderr)
@@ -1149,7 +1157,10 @@ def register_arguments(parser):
sequence_filter_group.add_argument('--non-nucleotide', action='store_true', help="exclude sequences that contain illegal characters")
subsample_group = parser.add_argument_group("subsampling", "options to subsample filtered data")
- subsample_group.add_argument('--group-by', nargs='+', help="categories with respect to subsample; two virtual fields, \"month\" and \"year\", are supported if they don't already exist as real fields but a \"date\" field does exist")
+ subsample_group.add_argument('--group-by', nargs='+', help="""
+ categories with respect to subsample.
+ Grouping by 'year' and/or 'month' is only supported when there is a 'date' column in the metadata.
+ Custom 'year' and 'month' columns in the metadata are ignored for grouping. Please rename them if you want to use their values for grouping.""")
subsample_limits_group = subsample_group.add_mutually_exclusive_group()
subsample_limits_group.add_argument('--sequences-per-group', type=int, help="subsample to no more than this number of sequences per category")
subsample_limits_group.add_argument('--subsample-max-sequences', type=int, help="subsample to no more than this number of sequences; can be used without the group_by argument")
@@ -1398,11 +1409,11 @@ def run(args):
# Track distinct strains to include, so we can write their
# corresponding metadata, strains, or sequences later, as needed..
- distinct_sequences_to_include = {
+ distinct_force_included_strains = {
record["strain"]
for record in sequences_to_include
}
- all_sequences_to_include.update(distinct_sequences_to_include)
+ all_sequences_to_include.update(distinct_force_included_strains)
# Track reasons for filtered or force-included strains, so we can
# report total numbers filtered and included at the end. Optionally,
@@ -1418,6 +1429,10 @@ def run(args):
output_log_writer.writerow(filtered_strain)
if group_by:
+ # Prevent force-included sequences from being included again during
+ # subsampling.
+ seq_keep = seq_keep - distinct_force_included_strains
+
# If grouping, track the highest priority metadata records or
# count the number of records per group. First, we need to get
# the groups for the given records.
@@ -1466,13 +1481,13 @@ def run(args):
# Always write out strains that are force-included. Additionally, if
# we are not grouping, write out metadata and strains that passed
# filters so far.
- strains_to_write = distinct_sequences_to_include
+ force_included_strains_to_write = distinct_force_included_strains
if not group_by:
- strains_to_write = strains_to_write | seq_keep
+ force_included_strains_to_write = force_included_strains_to_write | seq_keep
if args.output_metadata:
# TODO: wrap logic to write metadata into its own function
- metadata.loc[list(strains_to_write)].to_csv(
+ metadata.loc[list(force_included_strains_to_write)].to_csv(
args.output_metadata,
sep="\t",
header=metadata_header,
@@ -1484,7 +1499,7 @@ def run(args):
if args.output_strains:
# TODO: Output strains will no longer be ordered. This is a
# small breaking change.
- for strain in strains_to_write:
+ for strain in force_included_strains_to_write:
output_strains.write(f"{strain}\n")
# In the worst case, we need to calculate sequences per group from the
=====================================
augur/io_support/__init__.py deleted
=====================================
=====================================
debian/changelog
=====================================
@@ -1,3 +1,9 @@
+augur (16.0.2-1) unstable; urgency=medium
+
+ * New upstream version
+
+ -- Étienne Mollier <emollier at debian.org> Sun, 03 Jul 2022 17:47:31 +0200
+
augur (16.0.1-1) unstable; urgency=medium
* d/control: add myself to Uploaders.
=====================================
tests/builds/zika/results/nt_muts.json
=====================================
@@ -1,4 +1,11 @@
{
+ "annotations": {
+ "nuc": {
+ "end": 10769,
+ "start": 1,
+ "strand": "+"
+ }
+ },
"generated_by": {
"program": "augur",
"version": "7.0.2"
=====================================
tests/functional/ancestral.t
=====================================
@@ -15,6 +15,16 @@ The default is to infer ambiguous bases, so there should not be N bases in the i
$ grep N "$TMP/ancestral_sequences.fasta"
>NODE_0000000
+Check that the reference length was correctly exported as the nuc annotation
+ $ grep -A 6 'annotations' "$TMP/ancestral_mutations.json"
+ "annotations": {
+ "nuc": {
+ "end": 10769,
+ "start": 1,
+ "strand": "+"
+ }
+ },
+
Infer ancestral sequences for the given tree and alignment, explicitly requesting that ambiguous bases are inferred.
There should not be N bases in the inferred output sequences.
=====================================
tests/functional/ancestral/tree_raw.nwk
=====================================
@@ -1 +1 @@
-(KX369547.1:0.00018895,1_0087_PF:0.00047255,1_0181_PF:0.00018905):0.00000000;
+(KX369547.1:0.00018895,1_0087_PF:0.00047255,1_0181_PF:0.00018905)NODE_0000000:0.00000000;
=====================================
tests/functional/filter/cram/filter-force-include-no-duplicates.t
=====================================
@@ -0,0 +1,95 @@
+Setup
+
+ $ pushd "$TESTDIR" > /dev/null
+ $ source _setup.sh
+
+
+Test that a force-included strain is only output once.
+
+
+Create some files for testing.
+
+ $ cat >$TMP/metadata.tsv <<~~
+ > strain col
+ > a 1
+ > b 2
+ > c 3
+ > d 4
+ > ~~
+ $ cat >$TMP/sequences.fasta <<~~
+ > >a
+ > NNNN
+ > >b
+ > NNNN
+ > >c
+ > NNNN
+ > >d
+ > NNNN
+ > ~~
+
+Test all outputs with --include-where.
+
+ $ ${AUGUR} filter \
+ > --metadata $TMP/metadata.tsv \
+ > --sequences $TMP/sequences.fasta \
+ > --subsample-max-sequences 4 \
+ > --include-where col=1 \
+ > --subsample-seed 0 \
+ > --output-metadata $TMP/metadata-filtered.tsv \
+ > --output-strains $TMP/strains-filtered.txt \
+ > --output-sequences $TMP/sequences-filtered.fasta \
+ > > /dev/null 2>&1
+ $ cat $TMP/metadata-filtered.tsv | tail -n+2 | sort -k1
+ a\t1 (esc)
+ b\t2 (esc)
+ c\t3 (esc)
+ d\t4 (esc)
+ $ cat $TMP/strains-filtered.txt | sort
+ a
+ b
+ c
+ d
+ $ cat $TMP/sequences-filtered.fasta
+ >a
+ NNNN
+ >b
+ NNNN
+ >c
+ NNNN
+ >d
+ NNNN
+
+Test all outputs with --include.
+
+ $ cat >$TMP/include.txt <<~~
+ > a
+ > ~~
+ $ ${AUGUR} filter \
+ > --metadata $TMP/metadata.tsv \
+ > --sequences $TMP/sequences.fasta \
+ > --subsample-max-sequences 4 \
+ > --include $TMP/include.txt \
+ > --subsample-seed 0 \
+ > --output-metadata $TMP/metadata-filtered.tsv \
+ > --output-strains $TMP/strains-filtered.txt \
+ > --output-sequences $TMP/sequences-filtered.fasta \
+ > > /dev/null 2>&1
+ $ cat $TMP/metadata-filtered.tsv | tail -n+2 | sort -k1
+ a\t1 (esc)
+ b\t2 (esc)
+ c\t3 (esc)
+ d\t4 (esc)
+ $ cat $TMP/strains-filtered.txt | sort
+ a
+ b
+ c
+ d
+ $ cat $TMP/sequences-filtered.fasta
+ >a
+ NNNN
+ >b
+ NNNN
+ >c
+ NNNN
+ >d
+ NNNN
=====================================
tests/functional/filter/cram/subsample-group-by-with-custom-year-column.t
=====================================
@@ -0,0 +1,45 @@
+Setup
+
+ $ pushd "$TESTDIR" > /dev/null
+ $ source _setup.sh
+
+Create a metadata file with a custom year column
+
+ $ cat >$TMP/metadata-year-column.tsv <<~~
+ > strain date year month
+ > SEQ1 2021-01-01 odd January
+ > SEQ2 2021-01-02 odd January
+ > SEQ3 2022-01-01 even January
+ > SEQ4 2022-01-02 even January
+ > SEQ5 2022-02-02 even February
+ > ~~
+
+Group by generated year column, and ensure all original columns are still in the final output.
+
+ $ ${AUGUR} filter \
+ > --metadata $TMP/metadata-year-column.tsv \
+ > --group-by year \
+ > --sequences-per-group 1 \
+ > --subsample-seed 0 \
+ > --output-metadata "$TMP/filtered_metadata.tsv" > /dev/null
+ WARNING: `--group-by year` uses the generated year value from the 'date' column. The custom 'year' column in the metadata is ignored for grouping purposes.
+ $ cat "$TMP/filtered_metadata.tsv"
+ strain\tdate\tyear\tmonth (esc)
+ SEQ1\t2021-01-01\todd\tJanuary (esc)
+ SEQ5\t2022-02-02\teven\tFebruary (esc)
+
+Group by generated year and month columns, and ensure all original columns are still in the final output.
+
+ $ ${AUGUR} filter \
+ > --metadata $TMP/metadata-year-column.tsv \
+ > --group-by year month \
+ > --sequences-per-group 1 \
+ > --subsample-seed 0 \
+ > --output-metadata "$TMP/filtered_metadata.tsv" > /dev/null
+ WARNING: `--group-by year` uses the generated year value from the 'date' column. The custom 'year' column in the metadata is ignored for grouping purposes.
+ WARNING: `--group-by month` uses the generated month value from the 'date' column. The custom 'month' column in the metadata is ignored for grouping purposes.
+ $ cat "$TMP/filtered_metadata.tsv"
+ strain\tdate\tyear\tmonth (esc)
+ SEQ1\t2021-01-01\todd\tJanuary (esc)
+ SEQ3\t2022-01-01\teven\tJanuary (esc)
+ SEQ5\t2022-02-02\teven\tFebruary (esc)
View it on GitLab: https://salsa.debian.org/med-team/augur/-/compare/0b6a744aefe0e50d30d6f2b5519affc06f0497c1...9844f882ad40b8f3a85023724e616f0d7ea29275
--
View it on GitLab: https://salsa.debian.org/med-team/augur/-/compare/0b6a744aefe0e50d30d6f2b5519affc06f0497c1...9844f882ad40b8f3a85023724e616f0d7ea29275
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20220703/d8a092b3/attachment-0001.htm>
More information about the debian-med-commit
mailing list