[med-svn] [Git][med-team/augur][upstream] New upstream version 13.0.1
Nilesh Patra (@nilesh)
gitlab at salsa.debian.org
Wed Oct 6 16:29:11 BST 2021
Nilesh Patra pushed to branch upstream at Debian Med / augur
Commits:
3288cbfb by Nilesh Patra at 2021-10-06T20:52:36+05:30
New upstream version 13.0.1
- - - - -
6 changed files:
- CHANGES.md
- augur/__version__.py
- augur/filter.py
- docs/faq/metadata.md
- tests/builds/various_export_settings/config/footer-description.md
- tests/functional/filter.t
Changes:
=====================================
CHANGES.md
=====================================
@@ -3,6 +3,16 @@
## __NEXT__
+## 13.0.1 (1 October 2021)
+
+### Bug Fixes
+
+* docs: Fix broken link to latitude/longitude documentation. [#766][] (@victorlin)
+* filter: Fix reproducibility of subsampling by using the user-defined random seed in all random function calls and by ordering strain sets as lists prior to adding strains to group-by priority queues. [#772][] (@huddlej)
+
+[#766]: https://github.com/nextstrain/augur/pull/766
+[#772]: https://github.com/nextstrain/augur/pull/772
+
## 13.0.0 (17 August 2021)
### Major Changes
=====================================
augur/__version__.py
=====================================
@@ -1,4 +1,4 @@
-__version__ = '13.0.0'
+__version__ = '13.0.1'
def is_augur_version_compatible(version):
=====================================
augur/filter.py
=====================================
@@ -1049,7 +1049,7 @@ class PriorityQueue:
yield item
-def create_queues_by_group(groups, max_size, max_attempts=100):
+def create_queues_by_group(groups, max_size, max_attempts=100, random_seed=None):
"""Create a dictionary of priority queues per group for the given maximum size.
When the maximum size is fractional, probabilistically sample the maximum
@@ -1067,24 +1067,32 @@ def create_queues_by_group(groups, max_size, max_attempts=100):
Create queues for two groups with a fractional maximum size. Their total max
size should still be an integer value greater than zero.
- >>> queues = create_queues_by_group(groups, 0.1)
+ >>> seed = 314159
+ >>> queues = create_queues_by_group(groups, 0.1, random_seed=seed)
>>> int(sum(queue.max_size for queue in queues.values())) > 0
True
+ A subsequent run of this function with the same groups and random seed
+ should produce the same queues and queue sizes.
+
+ >>> more_queues = create_queues_by_group(groups, 0.1, random_seed=seed)
+ >>> [queue.max_size for queue in queues.values()] == [queue.max_size for queue in more_queues.values()]
+ True
+
"""
queues_by_group = {}
total_max_size = 0
attempts = 0
if max_size < 1.0:
- random_generator = np.random.default_rng()
+ random_generator = np.random.default_rng(random_seed)
# For small fractional maximum sizes, it is possible to randomly select
# maximum queue sizes that all equal zero. When this happens, filtering
# fails unexpectedly. We make multiple attempts to create queues with
# maximum sizes greater than zero for at least one queue.
while total_max_size == 0 and attempts < max_attempts:
- for group in groups:
+ for group in sorted(groups):
if max_size < 1.0:
queue_max_size = random_generator.poisson(max_size)
else:
@@ -1428,10 +1436,11 @@ def run(args):
if queues_by_group is None:
queues_by_group = {}
- for strain, group in group_by_strain.items():
+ for strain in sorted(group_by_strain.keys()):
# During this first pass, we do not know all possible
# groups will be, so we need to build each group's queue
# as we first encounter the group.
+ group = group_by_strain[strain]
if group not in queues_by_group:
queues_by_group[group] = PriorityQueue(
max_size=sequences_per_group,
@@ -1501,6 +1510,7 @@ def run(args):
queues_by_group = create_queues_by_group(
records_per_group.keys(),
sequences_per_group,
+ random_seed=args.subsample_seed,
)
# Make a second pass through the metadata, only considering records that
@@ -1522,7 +1532,8 @@ def run(args):
group_by,
)
- for strain, group in group_by_strain.items():
+ for strain in sorted(group_by_strain.keys()):
+ group = group_by_strain[strain]
queues_by_group[group].add(
metadata.loc[strain],
priorities[strain],
=====================================
docs/faq/metadata.md
=====================================
@@ -37,7 +37,7 @@ Geographic locations can be broken down, for example, into `region`, `country`,
It is important that these are spelled consistently.
-If you want to include locations where augur doesn't know the lat-long values, you can include them - see how [here](lat_longs).
+If you want to include locations where augur doesn't know the lat-long values, you can include them - see how [here](./lat_longs.html).
### Consistancy and Style
=====================================
tests/builds/various_export_settings/config/footer-description.md
=====================================
@@ -16,8 +16,6 @@
[external link](https://github.com) should open in a new tab.
-<script>alert("Do bad things")</script>
-
---
horizontal
***
@@ -41,4 +39,3 @@ Markdown image renders as a centered image:
Multiple images with one or no line breaks are centered together:
[![Markdown img](https://nextstrain.org/static/nextstrain-logo-small.ea8c3e13.png)](https://nextstrain.org)[![Markdown img](https://nextstrain.org/static/nextstrain-logo-small.ea8c3e13.png)](https://nextstrain.org)
-
=====================================
tests/functional/filter.t
=====================================
@@ -22,10 +22,22 @@ Filter with subsampling, requesting 1 sequence per group (for a group with 4 dis
> --metadata filter/metadata.tsv \
> --group-by region \
> --sequences-per-group 1 \
+ > --subsample-seed 314159 \
> --output-strains "$TMP/filtered_strains.txt" > /dev/null
$ wc -l "$TMP/filtered_strains.txt"
\s*4 .* (re)
- $ rm -f "$TMP/filtered_strains.txt"
+
+By setting the subsample seed above, we should guarantee that we get the same "random" strains as another run with the same command.
+
+ $ ${AUGUR} filter \
+ > --metadata filter/metadata.tsv \
+ > --group-by region \
+ > --sequences-per-group 1 \
+ > --subsample-seed 314159 \
+ > --output-strains "$TMP/filtered_strains_repeated.txt" > /dev/null
+
+ $ diff -u <(sort "$TMP/filtered_strains.txt") <(sort "$TMP/filtered_strains_repeated.txt")
+ $ rm -f "$TMP/filtered_strains.txt" "$TMP/filtered_strains_repeated.txt"
Filter with subsampling, requesting no more than 8 sequences.
With 8 groups to subsample from (after filtering), this should produce one sequence per group.
@@ -88,8 +100,7 @@ Explicitly use probabilistic subsampling to handle the case when there are more
> --subsample-max-sequences 5 \
> --subsample-seed 314159 \
> --probabilistic-sampling \
- > --output "$TMP/filtered.fasta" > /dev/null
- $ rm -f "$TMP/filtered.fasta"
+ > --output-strains "$TMP/filtered_strains_probabilistic.txt" > /dev/null
Using the default probabilistic subsampling, should work the same as the previous case.
@@ -101,8 +112,12 @@ Using the default probabilistic subsampling, should work the same as the previou
> --group-by country year month \
> --subsample-max-sequences 5 \
> --subsample-seed 314159 \
- > --output "$TMP/filtered.fasta" > /dev/null
- $ rm -f "$TMP/filtered.fasta"
+ > --output-strains "$TMP/filtered_strains_default.txt" > /dev/null
+
+By setting the subsample seed above, we should get the same results for both runs.
+
+ $ diff -u <(sort "$TMP/filtered_strains_probabilistic.txt") <(sort "$TMP/filtered_strains_default.txt")
+ $ rm -f "$TMP/filtered_strains_probabilistic.txt" "$TMP/filtered_strains_default.txt"
Filter using only metadata without sequence input or output and save results as filtered metadata.
View it on GitLab: https://salsa.debian.org/med-team/augur/-/commit/3288cbfb59baaf11594edb77c5e8fc93f0a112b6
--
View it on GitLab: https://salsa.debian.org/med-team/augur/-/commit/3288cbfb59baaf11594edb77c5e8fc93f0a112b6
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20211006/46145078/attachment-0001.htm>
More information about the debian-med-commit
mailing list