[med-svn] [Git][med-team/augur][upstream] New upstream version 13.0.1

Wed Oct 6 16:29:11 BST 2021


Nilesh Patra pushed to branch upstream at Debian Med / augur


Commits:
3288cbfb by Nilesh Patra at 2021-10-06T20:52:36+05:30
New upstream version 13.0.1
- - - - -


6 changed files:

- CHANGES.md
- augur/__version__.py
- augur/filter.py
- docs/faq/metadata.md
- tests/builds/various_export_settings/config/footer-description.md
- tests/functional/filter.t


Changes:

=====================================
CHANGES.md
=====================================
@@ -3,6 +3,16 @@
 ## __NEXT__
 
 
+## 13.0.1 (1 October 2021)
+
+### Bug Fixes
+
+* docs: Fix broken link to latitude/longitude documentation. [#766][] (@victorlin)
+* filter: Fix reproducibility of subsampling by using the user-defined random seed in all random function calls and by ordering strain sets as lists prior to adding strains to group-by priority queues. [#772][] (@huddlej)
+
+[#766]: https://github.com/nextstrain/augur/pull/766
+[#772]: https://github.com/nextstrain/augur/pull/772
+
 ## 13.0.0 (17 August 2021)
 
 ### Major Changes


=====================================
augur/__version__.py
=====================================
@@ -1,4 +1,4 @@
-__version__ = '13.0.0'
+__version__ = '13.0.1'
 
 
 def is_augur_version_compatible(version):


=====================================
augur/filter.py
=====================================
@@ -1049,7 +1049,7 @@ class PriorityQueue:
             yield item
 
 
-def create_queues_by_group(groups, max_size, max_attempts=100):
+def create_queues_by_group(groups, max_size, max_attempts=100, random_seed=None):
     """Create a dictionary of priority queues per group for the given maximum size.
 
     When the maximum size is fractional, probabilistically sample the maximum
@@ -1067,24 +1067,32 @@ def create_queues_by_group(groups, max_size, max_attempts=100):
     Create queues for two groups with a fractional maximum size. Their total max
     size should still be an integer value greater than zero.
 
-    >>> queues = create_queues_by_group(groups, 0.1)
+    >>> seed = 314159
+    >>> queues = create_queues_by_group(groups, 0.1, random_seed=seed)
     >>> int(sum(queue.max_size for queue in queues.values())) > 0
     True
 
+    A subsequent run of this function with the same groups and random seed
+    should produce the same queues and queue sizes.
+
+    >>> more_queues = create_queues_by_group(groups, 0.1, random_seed=seed)
+    >>> [queue.max_size for queue in queues.values()] == [queue.max_size for queue in more_queues.values()]
+    True
+
     """
     queues_by_group = {}
     total_max_size = 0
     attempts = 0
 
     if max_size < 1.0:
-        random_generator = np.random.default_rng()
+        random_generator = np.random.default_rng(random_seed)
 
     # For small fractional maximum sizes, it is possible to randomly select
     # maximum queue sizes that all equal zero. When this happens, filtering
     # fails unexpectedly. We make multiple attempts to create queues with
     # maximum sizes greater than zero for at least one queue.
     while total_max_size == 0 and attempts < max_attempts:
-        for group in groups:
+        for group in sorted(groups):
             if max_size < 1.0:
                 queue_max_size = random_generator.poisson(max_size)
             else:
@@ -1428,10 +1436,11 @@ def run(args):
                     if queues_by_group is None:
                         queues_by_group = {}
 
-                    for strain, group in group_by_strain.items():
+                    for strain in sorted(group_by_strain.keys()):
                         # During this first pass, we do not know all possible
                         # groups will be, so we need to build each group's queue
                         # as we first encounter the group.
+                        group = group_by_strain[strain]
                         if group not in queues_by_group:
                             queues_by_group[group] = PriorityQueue(
                                 max_size=sequences_per_group,
@@ -1501,6 +1510,7 @@ def run(args):
             queues_by_group = create_queues_by_group(
                 records_per_group.keys(),
                 sequences_per_group,
+                random_seed=args.subsample_seed,
             )
 
         # Make a second pass through the metadata, only considering records that
@@ -1522,7 +1532,8 @@ def run(args):
                 group_by,
             )
 
-            for strain, group in group_by_strain.items():
+            for strain in sorted(group_by_strain.keys()):
+                group = group_by_strain[strain]
                 queues_by_group[group].add(
                     metadata.loc[strain],
                     priorities[strain],


=====================================
docs/faq/metadata.md
=====================================
@@ -37,7 +37,7 @@ Geographic locations can be broken down, for example, into `region`, `country`,
 
 It is important that these are spelled consistently.
 
-If you want to include locations where augur doesn't know the lat-long values, you can include them - see how [here](lat_longs).
+If you want to include locations where augur doesn't know the lat-long values, you can include them - see how [here](./lat_longs.html).
 
 ### Consistancy and Style
 


=====================================
tests/builds/various_export_settings/config/footer-description.md
=====================================
@@ -16,8 +16,6 @@
 
 [external link](https://github.com) should open in a new tab.
 
-<script>alert("Do bad things")</script>
-
 ---
 horizontal
 ***
@@ -41,4 +39,3 @@ Markdown image renders as a centered image:
 Multiple images with one or no line breaks are centered together:
 
 [![Markdown img](https://nextstrain.org/static/nextstrain-logo-small.ea8c3e13.png)](https://nextstrain.org)[![Markdown img](https://nextstrain.org/static/nextstrain-logo-small.ea8c3e13.png)](https://nextstrain.org)
-


=====================================
tests/functional/filter.t
=====================================
@@ -22,10 +22,22 @@ Filter with subsampling, requesting 1 sequence per group (for a group with 4 dis
   >  --metadata filter/metadata.tsv \
   >  --group-by region \
   >  --sequences-per-group 1 \
+  >  --subsample-seed 314159 \
   >  --output-strains "$TMP/filtered_strains.txt" > /dev/null
   $ wc -l "$TMP/filtered_strains.txt"
   \s*4 .* (re)
-  $ rm -f "$TMP/filtered_strains.txt"
+
+By setting the subsample seed above, we should guarantee that we get the same "random" strains as another run with the same command.
+
+  $ ${AUGUR} filter \
+  >  --metadata filter/metadata.tsv \
+  >  --group-by region \
+  >  --sequences-per-group 1 \
+  >  --subsample-seed 314159 \
+  >  --output-strains "$TMP/filtered_strains_repeated.txt" > /dev/null
+
+  $ diff -u <(sort "$TMP/filtered_strains.txt") <(sort "$TMP/filtered_strains_repeated.txt")
+  $ rm -f "$TMP/filtered_strains.txt" "$TMP/filtered_strains_repeated.txt"
 
 Filter with subsampling, requesting no more than 8 sequences.
 With 8 groups to subsample from (after filtering), this should produce one sequence per group.
@@ -88,8 +100,7 @@ Explicitly use probabilistic subsampling to handle the case when there are more
   >  --subsample-max-sequences 5 \
   >  --subsample-seed 314159 \
   >  --probabilistic-sampling \
-  >  --output "$TMP/filtered.fasta" > /dev/null
-  $ rm -f "$TMP/filtered.fasta"
+  >  --output-strains "$TMP/filtered_strains_probabilistic.txt" > /dev/null
 
 Using the default probabilistic subsampling, should work the same as the previous case.
 
@@ -101,8 +112,12 @@ Using the default probabilistic subsampling, should work the same as the previou
   >  --group-by country year month \
   >  --subsample-max-sequences 5 \
   >  --subsample-seed 314159 \
-  >  --output "$TMP/filtered.fasta" > /dev/null
-  $ rm -f "$TMP/filtered.fasta"
+  >  --output-strains "$TMP/filtered_strains_default.txt" > /dev/null
+
+By setting the subsample seed above, we should get the same results for both runs.
+
+  $ diff -u <(sort "$TMP/filtered_strains_probabilistic.txt") <(sort "$TMP/filtered_strains_default.txt")
+  $ rm -f "$TMP/filtered_strains_probabilistic.txt" "$TMP/filtered_strains_default.txt"
 
 Filter using only metadata without sequence input or output and save results as filtered metadata.
 



View it on GitLab: https://salsa.debian.org/med-team/augur/-/commit/3288cbfb59baaf11594edb77c5e8fc93f0a112b6

-- 
View it on GitLab: https://salsa.debian.org/med-team/augur/-/commit/3288cbfb59baaf11594edb77c5e8fc93f0a112b6
You're receiving this email because of your account on salsa.debian.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20211006/46145078/attachment-0001.htm>