[med-svn] [Git][med-team/augur][master] 4 commits: routine-update: New upstream version

Étienne Mollier (@emollier) gitlab at salsa.debian.org
Mon Sep 4 20:46:35 BST 2023



Étienne Mollier pushed to branch master at Debian Med / augur


Commits:
a3edebb0 by Étienne Mollier at 2023-09-01T21:33:32+02:00
routine-update: New upstream version

- - - - -
d0a658be by Étienne Mollier at 2023-09-01T21:33:33+02:00
New upstream version 22.4.0
- - - - -
8600e56c by Étienne Mollier at 2023-09-01T21:34:27+02:00
Update upstream source from tag 'upstream/22.4.0'

Update to upstream version '22.4.0'
with Debian dir da42b893d50f841c32f4d482839a7c04f2933a05
- - - - -
47687198 by Étienne Mollier at 2023-09-01T21:38:08+02:00
routine-update: Ready to upload to unstable

- - - - -


18 changed files:

- .github/workflows/ci.yaml
- CHANGES.md
- augur/__main__.py
- augur/__version__.py
- augur/ancestral.py
- augur/clades.py
- augur/data/schema-annotations.json
- augur/data/schema-export-v2.json
- augur/distance.py
- augur/filter/include_exclude_rules.py
- augur/refine.py
- augur/validate.py
- debian/changelog
- tests/functional/filter/cram/filter-query-numerical.t
- tests/functional/refine/cram/timetree.t
- + tests/functional/refine/cram/timetree_with_fixed_clock_rate.t
- tests/test_validate.py
- tests/util_support/test_node_data_file.py


Changes:

=====================================
.github/workflows/ci.yaml
=====================================
@@ -80,7 +80,10 @@ jobs:
         name: coverage
         path: "${{ env.COVERAGE_FILE }}"
 
-  pathogen-ci:
+  # TODO: Use the central pathogen-repo-ci workflow¹. Currently, this is not
+  # possible because it only supports "stock" docker and conda runtimes.
+  # ¹ https://github.com/nextstrain/.github/blob/-/.github/workflows/pathogen-repo-ci.yaml
+  pathogen-repo-ci:
     runs-on: ubuntu-latest
     continue-on-error: true
     env:
@@ -97,48 +100,62 @@ jobs:
               pathogen: ncov,
               build-args: all_regions -j 2 --profile nextstrain_profiles/nextstrain-ci,
             }
+          - { pathogen: rsv }
           - {
               pathogen: seasonal-flu,
               build-args: --configfile profiles/ci/builds.yaml -p,
             }
           - { pathogen: zika }
-    name: test-pathogen-repo-ci (${{ matrix.pathogen }})
+    name: pathogen-repo-ci (${{ matrix.pathogen }})
     defaults:
       run:
         shell: bash -l {0}
     steps:
       - uses: actions/checkout at v3
-      - uses: mamba-org/provision-with-micromamba at v15
         with:
-          environment-file: false
+          path: ./augur
+
+      - uses: mamba-org/setup-micromamba at v1
+        with:
+          create-args: nextstrain-base
+          condarc: |
+            channels:
+              - nextstrain
+              - conda-forge
+              - bioconda
+            channel_priority: strict
+          cache-environment: true
           environment-name: augur
-          extra-specs: nextstrain-base
-          channels: nextstrain,conda-forge,bioconda
-          cache-env: true
-      - run: pip install .
+
+      - run: pip install ./augur
 
       - uses: actions/checkout at v3
         with:
           repository: nextstrain/${{ matrix.pathogen }}
+          path: ./pathogen-repo
+
       - name: Copy example data
+        working-directory: ./pathogen-repo
         run: |
           if [[ -d example_data ]]; then
             mkdir -p data/
-            cp -v example_data/* data/
+            cp -r -v example_data/* data/
           else
             echo No example data to copy.
           fi
-      - run: snakemake -c all ${{ matrix.build-args }}
+
+      - run: nextstrain build --ambient ./pathogen-repo ${{ matrix.build-args }}
+
       - if: always()
         uses: actions/upload-artifact at v3
         with:
           name: output-${{ matrix.pathogen }}
           path: |
-            auspice/
-            results/
-            benchmarks/
-            logs/
-            .snakemake/log/
+            ./pathogen-repo/auspice/
+            ./pathogen-repo/results/
+            ./pathogen-repo/benchmarks/
+            ./pathogen-repo/logs/
+            ./pathogen-repo/.snakemake/log/
 
   codecov:
     if: github.repository == 'nextstrain/augur'


=====================================
CHANGES.md
=====================================
@@ -3,6 +3,23 @@
 ## __NEXT__
 
 
+## 22.4.0 (29 August 2023)
+
+### Features
+
+* refine: Export covariance matrix and standard deviation for clock rate regression in the node data JSON output when these values are calculated by TreeTime. These new values appear in the `clock` data structure of the JSON output as `cov` and `rate_std` keys, respectively. [#1284][] (@huddlej)
+
+### Bug fixes
+
+* clades: Fix outputs for genes named `NA` (previously the value was replaced by `nan`). [#1293][] (@rneher)
+* distance: Improve documentation by describing how gaps get treated as indels and how users can ignore specific characters in distance calculations. [#1285][] (@huddlej)
+* Fix help output compatibility with non-Unicode streams. [#1290][] (@victorlin)
+
+[#1284]: https://github.com/nextstrain/augur/pull/1284
+[#1285]: https://github.com/nextstrain/augur/pull/1285
+[#1290]: https://github.com/nextstrain/augur/pull/1290
+[#1293]: https://github.com/nextstrain/augur/pull/1293
+
 ## 22.3.0 (14 August 2023)
 
 ### Features


=====================================
augur/__main__.py
=====================================
@@ -2,11 +2,25 @@
 Stub function and module used as a setuptools entry point.
 """
 
+import sys
 import augur
 from sys import argv, exit
 
 # Entry point for setuptools-installed script and bin/augur dev wrapper.
 def main():
+    sys.stdout.reconfigure(
+        # Support non-Unicode encodings by replacing Unicode characters instead of erroring.
+        errors="backslashreplace",
+
+        # Explicitly enable universal newlines mode so we do the right thing.
+        newline=None,
+    )
+    # Apply the above to stderr as well.
+    sys.stderr.reconfigure(
+        errors="backslashreplace",
+        newline=None,
+    )
+
     return augur.run( argv[1:] )
 
 # Run when called as `python -m augur`, here for good measure.


=====================================
augur/__version__.py
=====================================
@@ -1,4 +1,4 @@
-__version__ = '22.3.0'
+__version__ = '22.4.0'
 
 
 def is_augur_version_compatible(version):


=====================================
augur/ancestral.py
=====================================
@@ -8,6 +8,16 @@ Each node then gets assigned a list of nucleotide mutations for any position
 that has a mismatch between its own sequence and its parent's sequence.
 The node sequences and mutations are output to a node-data JSON file.
 
+If amino acid options are provided, the ancestral amino acid sequences for each
+requested gene are inferred with the same method as the nucleotide sequences described above.
+The inferred amino acid mutations will be included in the output node-data JSON
+file, with the format equivalent to the output of `augur translate`.
+
+The nucleotide and amino acid sequences are inferred separately in this command,
+which can potentially result in mismatches between the nucleotide and amino
+acid mutations. If you want amino acid mutations based on the inferred
+nucleotide sequences, please use `augur translate`.
+
 .. note::
 
     The mutation positions in the node-data JSON are one-based.


=====================================
augur/clades.py
=====================================
@@ -1,7 +1,7 @@
 """
 Assign clades to nodes in a tree based on amino-acid or nucleotide signatures.
 
-Nodes which are members of a clade are stored via 
+Nodes which are members of a clade are stored via
 <OUTPUT_NODE_DATA> → nodes → <node_name> → clade_membership
 and if this file is used in `augur export v2` these will automatically become a coloring.
 
@@ -62,7 +62,8 @@ def read_in_clade_definitions(clade_file):
     df = pd.read_csv(
         clade_file,
         sep='\t' if clade_file.endswith('.tsv') else ',',
-        comment='#'
+        comment='#',
+        na_filter=False,
     )
 
     clade_inheritance_rows = df[df['gene'] == 'clade']
@@ -83,9 +84,13 @@ def read_in_clade_definitions(clade_file):
     # Use integer 0 as root so as not to conflict with any string clade names
     # String '0' can still be used this way
     root = 0
+
+    # Skip rows that are missing a clade name.
+    defined_clades = (clade for clade in df.clade.unique() if clade != '')
+
     # For every clade, add edge from root as default
     # This way all clades can be reached by traversal
-    for clade in df.clade.unique():
+    for clade in defined_clades:
         G.add_edge(root, clade)
 
     # Build inheritance graph
@@ -181,7 +186,7 @@ def ensure_no_multiple_mutations(all_muts):
             aa_positions = [int(mut[1:-1])-1 for mut in node['aa_muts'][gene]]
             if len(set(aa_positions))!=len(aa_positions):
                 multiples.append(f"Node {name} ({gene})")
-    
+
     if multiples:
         raise AugurError(f"Multiple mutations at the same position on a single branch were found: {', '.join(multiples)}")
 
@@ -310,7 +315,7 @@ def get_reference_sequence_from_root_node(all_muts, root_name):
         except KeyError:
             missing.append(gene)
 
-    if missing:            
+    if missing:
         print(f"WARNING in augur.clades: sequences at the root node have not been specified for {{{', '.join(missing)}}}, \
 even though mutations were observed. Clades which are annotated using bases/codons present at the root \
 of the tree may not be correctly inferred.")
@@ -358,7 +363,6 @@ def run(args):
         ref = get_reference_sequence_from_root_node(all_muts, tree.root.name)
 
     clade_designations = read_in_clade_definitions(args.clades)
-
     membership, labels = assign_clades(clade_designations, all_muts, tree, ref)
     warn_if_clades_not_found(membership, clade_designations)
 


=====================================
augur/data/schema-annotations.json
=====================================
@@ -1,34 +1,90 @@
 {
     "type" : "object",
     "$schema": "http://json-schema.org/draft-06/schema#",
-    "title": "JSON object for the `annotations` key, typically produced by `augur translate`",
-    "description": "Coordinates etc of genes / genome",
+    "$id": "https://nextstrain.org/schemas/augur/annotations",
+    "title": "Schema for the 'annotations' property (node-data JSON) or the 'genome_annotations' property (auspice JSON)",
+    "properties": {
+        "nuc": {
+            "type": "object",
+            "allOf": [{ "$ref": "#/$defs/startend" }],
+            "properties": {
+                "start": {
+                    "enum": [1],
+                    "$comment": "nuc must begin at 1"
+                },
+                "strand": {
+                    "type": "string",
+                    "enum":["+"],
+                    "description": "Strand is optional for nuc, as it should be +ve for all genomes (-ve strand genomes are reverse complemented)",
+                    "$comment": "Auspice will not proceed if the JSON has strand='-'"
+                }
+            },
+            "additionalProperties": true,
+            "$comment": "All other properties are unused by Auspice."
+        }
+    },
+    "required": ["nuc"],
     "patternProperties": {
-        "^[a-zA-Z0-9*_-]+$": {
+        "^(?!nuc)[a-zA-Z0-9*_-]+$": {
+            "$comment": "Each object here defines a single CDS",
             "type": "object",
+            "oneOf": [{ "$ref": "#/$defs/startend" }, { "$ref": "#/$defs/segments" }],
+            "additionalProperties": true,
+            "required": ["strand"],
             "properties": {
-                "seqid":{
-                    "description": "Sequence on which the coordinates below are valid. Could be viral segment, bacterial contig, etc",
-                    "$comment": "Unused by Auspice 2.0",
-                    "type": "string"
+                "gene": {
+                    "type": "string",
+                    "description": "The name of the gene the CDS is from. Optional.",
+                    "$comment": "Shown in on-hover infobox & influences default CDS colors"
                 },
-                "type": {
-                    "description": "Type of the feature. could be mRNA, CDS, or similar",
-                    "$comment": "Unused by Auspice 2.0",
-                    "type": "string"
+                "strand": {
+                    "description": "Strand of the CDS",
+                    "type": "string",
+                    "enum": ["-", "+"]
                 },
-                "start": {
-                    "description": "Gene start position (one-based, following GFF format)",
-                    "type": "number"
+                "color": {
+                    "type": "string",
+                    "description": "A CSS color or a color hex code. Optional."
                 },
-                "end": {
-                    "description": "Gene end position (one-based closed, last position of feature, following GFF format)",
-                    "type": "number"
+                "display_name": {
+                    "type": "string",
+                    "$comment": "Shown in the on-hover info box"
                 },
-                "strand": {
-                    "description": "Positive or negative strand",
+                "description": {
                     "type": "string",
-                    "enum": ["-","+"]
+                    "$comment": "Shown in the on-hover info box"
+                }
+            }
+        }
+    },
+    "$defs": {
+        "startend": {
+            "type": "object",
+            "required": ["start", "end"],
+            "properties": {
+                "start": {
+                    "type": "integer",
+                    "minimum": 1,
+                    "description": "Start position (one-based, following GFF format)"
+                },
+                "end": {
+                    "type": "integer",
+                    "minimum": 2,
+                    "description": "End position (one-based, following GFF format). This value _must_ be greater than the start."
+                }
+            }
+        },
+        "segments": {
+            "type": "object",
+            "required": ["segments"],
+            "properties": {
+                "segments": {
+                    "type": "array",
+                    "minItems": 1,
+                    "items": {
+                        "type": "object",
+                        "allOf": [{ "$ref": "#/$defs/startend" }]
+                    }
                 }
             }
         }


=====================================
augur/data/schema-export-v2.json
=====================================
@@ -51,44 +51,7 @@
                     }
                 },
                 "genome_annotations": {
-                    "description": "Genome annotations (e.g. genes), relative to the reference genome",
-                    "$comment": "Required for the entropy panel",
-                    "type": "object",
-                    "required": ["nuc"],
-                    "additionalProperties": false,
-                    "properties": {
-                        "nuc": {
-                            "type": "object",
-                            "properties": {
-                                "seqid":{
-                                    "description": "Sequence on which the coordinates below are valid. Could be viral segment, bacterial contig, etc",
-                                    "$comment": "currently unused by Auspice",
-                                    "type": "string"
-                                },
-                                "type": {
-                                    "description": "Type of the feature. could be mRNA, CDS, or similar",
-                                    "$comment": "currently unused by Auspice",
-                                    "type": "string"
-                                },
-                                "start": {
-                                    "description": "Gene start position (one-based, following GFF format)",
-                                    "type": "number"
-                                },
-                                "end": {
-                                    "description": "Gene end position (one-based closed, last position of feature, following GFF format)",
-                                    "type": "number"
-                                },
-                                "strand": {
-                                    "description": "Positive or negative strand",
-                                    "type": "string",
-                                    "enum": ["-","+"]
-                                }
-                            }
-                        }
-                    },
-                    "patternProperties": {
-                        "^[a-zA-Z0-9*_-]+$": {"$ref": "#/properties/meta/properties/genome_annotations/properties/nuc"}
-                    }
+                    "$ref": "https://nextstrain.org/schemas/augur/annotations"
                 },
                 "filters": {
                     "description": "These appear as filters in the footer of Auspice (which populates the displayed values based upon the tree)",


=====================================
augur/distance.py
=====================================
@@ -31,6 +31,12 @@ tips sampled from previous seasons prior to the given date. These two date
 parameters allow users to specify a fixed time interval for pairwise
 calculations, limiting the computationally complexity of the comparisons.
 
+For all distance calculations, a consecutive series of gap characters (`-`)
+counts as a single difference between any pair of sequences. This behavior
+reflects the assumption that there was an underlying biological process that
+produced the insertion or deletion as a single event as opposed to multiple
+independent insertion/deletion events.
+
 **Distance maps**
 
 Distance maps are defined in JSON format with two required top-level keys.
@@ -47,6 +53,19 @@ The simplest possible distance map calculates Hamming distance between sequences
         "map": {}
     }
 
+To ignore specific characters such as gaps or ambiguous nucleotides from the
+distance calculation, define a top-level `ignored_characters` key with a list of
+characters to ignore.
+
+.. code-block:: json
+
+    {
+        "name": "Hamming distance",
+        "default": 1,
+        "ignored_characters": ["-", "N"],
+        "map": {}
+    }
+
 By default, distances are floating point values whose precision can be controlled with the `precision` key that defines the number of decimal places to retain for each distance.
 The following example shows how to specify a precision of two decimal places in the final output:
 


=====================================
augur/filter/include_exclude_rules.py
=====================================
@@ -187,7 +187,20 @@ def filter_by_query(metadata, query) -> FilterFunctionReturn:
     # Create a copy to prevent modification of the original DataFrame.
     metadata_copy = metadata.copy()
 
-    # Try converting all columns to numeric.
+    # Support numeric comparisons in query strings.
+    #
+    # The built-in data type inference when loading the DataFrame does not
+    # support nullable numeric columns, so numeric comparisons won't work on
+    # those columns. pd.to_numeric does proper conversion on those columns, and
+    # will not make any changes to columns with other values.
+    #
+    # TODO: Parse the query string and apply conversion only to columns used for
+    # numeric comparison. Pandas does not expose the API used to parse the query
+    # string internally, so this is non-trivial and requires a bit of
+    # reverse-engineering. Commit 2ead5b3e3306dc1100b49eb774287496018122d9 got
+    # halfway there but had issues so it was reverted.
+    #
+    # TODO: Try boolean conversion?
     for column in metadata_copy.columns:
         metadata_copy[column] = pd.to_numeric(metadata_copy[column], errors='ignore')
 


=====================================
augur/refine.py
=====================================
@@ -259,6 +259,12 @@ def run(args):
         node_data['clock'] = {'rate': tt.date2dist.clock_rate,
                               'intercept': tt.date2dist.intercept,
                               'rtt_Tmrca': -tt.date2dist.intercept/tt.date2dist.clock_rate}
+        # Include the standard deviation of the clock rate, if the covariance
+        # matrix is available.
+        if hasattr(tt.date2dist, "cov") and tt.date2dist.cov is not None:
+            node_data["clock"]["cov"] = tt.date2dist.cov
+            node_data["clock"]["rate_std"] = np.sqrt(tt.date2dist.cov[0, 0])
+
         if args.coalescent=='skyline':
             try:
                 skyline, conf = tt.merger_model.skyline_inferred(gen=args.gen_per_year, confidence=2)


=====================================
augur/validate.py
=====================================
@@ -25,7 +25,7 @@ class ValidateError(Exception):
     pass
 
 
-def load_json_schema(path):
+def load_json_schema(path, refs=None):
     '''
     Load a JSON schema from the augur included set of schemas
     (located in augur/data)
@@ -40,7 +40,28 @@ def load_json_schema(path):
         Validator.check_schema(schema)
     except jsonschema.exceptions.SchemaError as err:
         raise ValidateError(f"Schema {path} is not a valid JSON Schema ({Validator.META_SCHEMA['$schema']}). Error: {err}")
-    return Validator(schema)
+    
+    if refs:
+        # Make the validator aware of additional schemas
+        schema_store = {k: json.loads(resource_string(__package__, os.path.join("data", v))) for k,v in refs.items()}
+        resolver = jsonschema.RefResolver.from_schema(schema,store=schema_store)
+        schema_validator = Validator(schema, resolver=resolver)
+    else:
+        schema_validator = Validator(schema)
+
+    # By default $ref URLs which we don't define in a schema_store are fetched
+    # by jsonschema. This often indicates a typo (the $ref doesn't match the key
+    # of the schema_store) or we forgot to add a local mapping for a new $ref.
+    # Either way, Augur should not be accessing the network. 
+    def resolve_remote(url):
+        # The exception type is not important as jsonschema will catch & re-raise as a RefResolutionError
+        raise Exception(f"The schema used for validation attempted to fetch the remote URL {url!r}. " +
+                        "Augur should resolve schema references to local files, please check the schema used " +
+                        "and update the appropriate schema_store as needed." )
+    schema_validator.resolver.resolve_remote = resolve_remote
+
+    return schema_validator
+
 
 def load_json(path):
     with open(path, 'rb') as fh:
@@ -163,7 +184,15 @@ def auspice_config_v2(config_json, **kwargs):
     validate(config, schema, config_json)
 
 def export_v2(main_json, **kwargs):
-    main_schema = load_json_schema("schema-export-v2.json")
+    # The main_schema uses references to other schemas, and the suggested use is
+    # to define these refs as valid URLs. Augur itself should not access schemas
+    # over the wire so we provide a mapping between URLs and filepaths here. The
+    # filepath is specified relative to ./augur/data (where all the schemas
+    # live).
+    refs = {
+        'https://nextstrain.org/schemas/augur/annotations': "schema-annotations.json"
+    }
+    main_schema = load_json_schema("schema-export-v2.json", refs)
 
     if main_json.endswith("frequencies.json") or main_json.endswith("entropy.json") or main_json.endswith("sequences.json"):
         raise ValidateError("This validation subfunction is for the main `augur export v2` JSON only.")


=====================================
debian/changelog
=====================================
@@ -1,3 +1,9 @@
+augur (22.4.0-1) unstable; urgency=medium
+
+  * New upstream version
+
+ -- Étienne Mollier <emollier at debian.org>  Fri, 01 Sep 2023 21:34:48 +0200
+
 augur (22.3.0-1) unstable; urgency=medium
 
   * New upstream version


=====================================
tests/functional/filter/cram/filter-query-numerical.t
=====================================
@@ -5,11 +5,11 @@ Setup
 Create metadata file for testing.
 
   $ cat >metadata.tsv <<~~
-  > strain	coverage
-  > SEQ_1	0.94
-  > SEQ_2	0.95
-  > SEQ_3	0.96
-  > SEQ_4	
+  > strain	coverage	category
+  > SEQ_1	0.94	A
+  > SEQ_2	0.95	B
+  > SEQ_3	0.96	C
+  > SEQ_4		
   > ~~
 
 The 'coverage' column should be query-able by numerical comparisons.
@@ -22,3 +22,14 @@ The 'coverage' column should be query-able by numerical comparisons.
   $ sort filtered_strains.txt
   SEQ_2
   SEQ_3
+
+The 'category' column will fail when used with a numerical comparison.
+
+  $ ${AUGUR} filter \
+  >  --metadata metadata.tsv \
+  >  --query "category >= 0.95" \
+  >  --output-strains filtered_strains.txt
+  ERROR: Internal Pandas error when applying query:
+  	'>=' not supported between instances of 'str' and 'float'
+  Ensure the syntax is valid per <https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-query>.
+  [2]


=====================================
tests/functional/refine/cram/timetree.t
=====================================
@@ -21,3 +21,23 @@ Confirm that TreeTime trees match expected topology and branch lengths.
 
   $ python3 "$TESTDIR/../../../../scripts/diff_trees.py" "$TESTDIR/../data/tree.nwk" tree.nwk --significant-digits 2
   {}
+
+Confirm that JSON output includes details about the clock rate.
+
+  $ grep -A 15 '\"clock\"' branch_lengths.json
+    "clock": {
+      "cov": [
+        [
+          .*, (re)
+          .* (re)
+        ],
+        [
+          .*, (re)
+          .* (re)
+        ]
+      ],
+      "intercept": .*, (re)
+      "rate": .*, (re)
+      "rate_std": .*, (re)
+      "rtt_Tmrca": .* (re)
+    },


=====================================
tests/functional/refine/cram/timetree_with_fixed_clock_rate.t
=====================================
@@ -0,0 +1,29 @@
+Setup
+
+  $ source "$TESTDIR"/_setup.sh
+
+Try building a time tree with a fixed clock rate and clock std dev.
+
+  $ ${AUGUR} refine \
+  >  --tree "$TESTDIR/../data/tree_raw.nwk" \
+  >  --alignment "$TESTDIR/../data/aligned.fasta" \
+  >  --metadata "$TESTDIR/../data/metadata.tsv" \
+  >  --output-tree tree.nwk \
+  >  --output-node-data branch_lengths.json \
+  >  --timetree \
+  >  --coalescent opt \
+  >  --date-confidence \
+  >  --date-inference marginal \
+  >  --clock-rate 0.0012 \
+  >  --clock-std-dev 0.0002 \
+  >  --clock-filter-iqd 4 \
+  >  --seed 314159 &> /dev/null
+
+Confirm that JSON output does not include information about the clock rate std dev, since it was provided by the user.
+
+  $ grep -A 4 '\"clock\"' branch_lengths.json
+    "clock": {
+      "intercept": .*, (re)
+      "rate": .*, (re)
+      "rtt_Tmrca": .* (re)
+    },


=====================================
tests/test_validate.py
=====================================
@@ -4,7 +4,10 @@ import random
 from augur.validate import (
     validate_collection_config_fields,
     validate_collection_display_defaults,
-    validate_measurements_config
+    validate_measurements_config,
+    load_json_schema,
+    validate_json,
+    ValidateError
 )
 
 
@@ -88,3 +91,71 @@ class TestValidateMeasurements():
         }
         assert not validate_measurements_config(measurements)
         assert capsys.readouterr().err == "ERROR: The default collection key 'invalid_collection' does not match any of the collections' keys.\n"
+
+
+ at pytest.fixture
+def genome_annotation_schema():
+    return load_json_schema("schema-annotations.json")
+
+class TestValidateGenomeAnnotations():
+    def test_negative_strand_nuc(self, capsys, genome_annotation_schema):
+        d = {"nuc": {"start": 1, "end": 200, "strand": "-"}}
+        with pytest.raises(ValidateError):
+            validate_json(d, genome_annotation_schema, "<test-json>")
+        capsys.readouterr() # suppress validation error printing
+
+    def test_nuc_not_starting_at_one(self, capsys, genome_annotation_schema):
+        d = {"nuc": {"start": 100, "end": 200, "strand": "+"}}
+        with pytest.raises(ValidateError):
+            validate_json(d, genome_annotation_schema, "<test-json>")
+        capsys.readouterr() # suppress validation error printing
+
+    def test_missing_nuc(self, capsys, genome_annotation_schema):
+        d = {"cds": {"start": 100, "end": 200, "strand": "+"}}
+        with pytest.raises(ValidateError):
+            validate_json(d, genome_annotation_schema, "<test-json>")
+        capsys.readouterr() # suppress validation error printing
+
+    def test_missing_properties(self, capsys, genome_annotation_schema):
+        d = {"nuc": {"start": 1, "end": 100}, "cds": {"start": 20, "strand": "+"}}
+        with pytest.raises(ValidateError):
+            validate_json(d, genome_annotation_schema, "<test-json>")
+        capsys.readouterr() # suppress validation error printing
+
+    def test_not_stranded_cds(self, capsys, genome_annotation_schema):
+        # Strand . is for features that are not stranded (as per GFF spec), and thus they're not CDSs
+        d = {"nuc": {"start": 1, "end": 100}, "cds": {"start": 18, "end": 20, "strand": "."}}
+        with pytest.raises(ValidateError):
+            validate_json(d, genome_annotation_schema, "<test-json>")
+        capsys.readouterr() # suppress validation error printing
+
+    def test_negative_coordinates(self, capsys, genome_annotation_schema):
+        d = {"nuc": {"start": 1, "end": 100}, "cds": {"start": -2, "end": 10, "strand": "+"}}
+        with pytest.raises(ValidateError):
+            validate_json(d, genome_annotation_schema, "<test-json>")
+        capsys.readouterr() # suppress validation error printing
+
+    def test_valid_genome(self, capsys, genome_annotation_schema):
+        d = {"nuc": {"start": 1, "end": 100}, "cds": {"start": 20,  "end": 28, "strand": "+"}}
+        validate_json(d, genome_annotation_schema, "<test-json>")
+        capsys.readouterr() # suppress validation error printing
+
+    def test_valid_segmented_genome(self, capsys, genome_annotation_schema):
+        d = {"nuc": {"start": 1, "end": 100},
+             "cds": {"segments": [{"start": 20,  "end": 28}], "strand": "+"}}
+        validate_json(d, genome_annotation_schema, "<test-json>")
+        capsys.readouterr() # suppress validation error printing
+
+    def test_invalid_segmented_genome(self, capsys, genome_annotation_schema):
+        d = {"nuc": {"start": 1, "end": 100},
+             "cds": {"segments": [{"start": 20,  "end": 28}, {"start": 27}], "strand": "+"}}
+        with pytest.raises(ValidateError):
+            validate_json(d, genome_annotation_schema, "<test-json>")
+        capsys.readouterr() # suppress validation error printing
+
+    def test_string_coordinates(self, capsys, genome_annotation_schema):
+        d = {"nuc": {"start": 1, "end": 100},
+             "cds": {"segments": [{"start": 20,  "end": 28}, {"start": "27", "end": "29"}], "strand": "+"}}
+        with pytest.raises(ValidateError):
+            validate_json(d, genome_annotation_schema, "<test-json>")
+        capsys.readouterr() # suppress validation error printing
\ No newline at end of file


=====================================
tests/util_support/test_node_data_file.py
=====================================
@@ -38,7 +38,7 @@ class TestNodeDataFile:
         build_node_data_file(
             f"""
             {{
-                "annotations": {{ "a": {{ "start": 5 }} }},
+                "annotations": {{ "nuc": {{ "start": 1, "end": 100 }} }},
                 "generated_by": {{ "program": "augur", "version": "{__version__}" }},
                 "nodes": {{ "a": 5 }}
             }}



View it on GitLab: https://salsa.debian.org/med-team/augur/-/compare/b3ac3e3e0e1c03d328b0018585af3e6e8acfe278...4768719855138a6e6cefd7ebded3696314650f40

-- 
View it on GitLab: https://salsa.debian.org/med-team/augur/-/compare/b3ac3e3e0e1c03d328b0018585af3e6e8acfe278...4768719855138a6e6cefd7ebded3696314650f40
You're receiving this email because of your account on salsa.debian.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20230904/fcf4d27e/attachment-0001.htm>


More information about the debian-med-commit mailing list