[med-svn] [Git][med-team/python-treetime][upstream] New upstream version 0.9.4

Andreas Tille (@tille) gitlab at salsa.debian.org
Tue Oct 11 20:15:01 BST 2022



Andreas Tille pushed to branch upstream at Debian Med / python-treetime


Commits:
83154f17 by Andreas Tille at 2022-10-11T21:02:52+02:00
New upstream version 0.9.4
- - - - -


14 changed files:

- README.md
- changelog.md
- docs/source/APIdoc.rst
- docs/source/installation.rst
- treetime/__init__.py
- treetime/argument_parser.py
- treetime/clock_tree.py
- treetime/distribution.py
- treetime/gtr.py
- treetime/treeanc.py
- treetime/treeregression.py
- treetime/treetime.py
- treetime/utils.py
- treetime/wrappers.py


Changes:

=====================================
README.md
=====================================
@@ -8,7 +8,7 @@
 
 TreeTime provides routines for ancestral sequence reconstruction and inference of molecular-clock phylogenies, i.e., a tree where all branches are scaled such that the positions of terminal nodes correspond to their sampling times and internal nodes are placed at the most likely time of divergence.
 
-To optimize the likelihood of time-scaled phylogenies, TreeTime uses an iterative approach that first infers ancestral sequences given the branch length of the tree, then optimizes the positions of unconstrained nodes on the time axis, and then repeats this cycle.
+To optimize the likelihood of time-scaled phylogenies, TreeTime uses an iterative approach that first infers ancestral sequences given the branch length of the tree (assuming the branch length of the input tree is in units of average number of nucleotide or protein substitutions per site in the sequence). After infering ancestral sequences TreeTime optimizes the positions of unconstrained nodes on the time axis, and then repeats this cycle.
 The only topology optimization are (optional) resolution of polytomies in a way that is most (approximately) consistent with the sampling time constraints on the tree.
 The package is designed to be used as a stand-alone tool on the command-line or as a library used in larger phylogenetic analysis work-flows.
 [The documentation of TreeTime is hosted on readthedocs.org](https://treetime.readthedocs.io/en/latest/).


=====================================
changelog.md
=====================================
@@ -1,3 +1,18 @@
+# 0.9.4: bug fix and performance improvments
+
+ * avoid negative variance associated with branch lengths in tree regression. This could happen in rare cases when marginal time tree estimation returned short negative branch length and the variance was estimated as being proportional to branch length. Variances in the `TreeRegression` clock model are now always non-negative.
+ * downsample the grid during multiplication of distribution objects. This turned out to be an issue for trees with very large polytomies. In these cases, a large number of distributions get multiplied which resulted in grid sizes above 100000 points. Grid sizes are now downsampled to the average grid size.
+
+# 0.9.3
+
+ * Add extra error class for "unknown" (==unhandled) errors
+ * Wrap `run` function and have it optionally raise unhandled exceptions as `TreeTimeUnknownError`.
+   This is mainly done to improve interaction with `augur` that uses `TreeTime` internals as a library.
+   (both by @anna-parker with input from @victorlin)
+
+[PR #206](https://github.com/neherlab/treetime/pull/206)
+[PR #208](https://github.com/neherlab/treetime/pull/208)
+
 # 0.9.2
 bug fix release:
  * CLI now works for windows (thanks @corneliusroemer for the fix)


=====================================
docs/source/APIdoc.rst
=====================================
@@ -35,7 +35,7 @@ Core classes
 ~~~~~~~~~~~~~~~~~~~~~
 
 :doc:`Coalescent class<merger_models>`
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 
 


=====================================
docs/source/installation.rst
=====================================
@@ -54,7 +54,7 @@ The API documentation for the TreeTime package is generated created with Sphinx.
 
   .. code:: bash
 
-	pip install recommonmark sphinx-argparse
+	pip install recommonmark sphinx-argparse sphinx_rtd_theme
 
 After required packages are installed, navigate to doc directory, and build the docs by typing:
 


=====================================
treetime/__init__.py
=====================================
@@ -1,7 +1,15 @@
-version="0.9.2"
-
+version="0.9.4"
+## Here we define an error class for TreeTime errors, MissingData, UnknownMethod and NotReady errors
+## are all due to incorrect calling of TreeTime functions or input data that does not fit our base assumptions.
+## Errors marked as TreeTimeUnknownErrors might be due to data not fulfilling base assumptions or due
+## to bugs in TreeTime. Please report them to the developers if they persist.
 class TreeTimeError(Exception):
-    """TreeTimeError class"""
+    """
+    TreeTimeError class
+    Parent class for more specific errors
+    Raised when treetime is used incorrectly in contrast with `TreeTimeUnknownError`
+    `TreeTimeUnknownError` is raised when the reason of the error is unknown, could indicate bug
+    """
     pass
 
 class MissingDataError(TreeTimeError):
@@ -9,13 +17,17 @@ class MissingDataError(TreeTimeError):
     pass
 
 class UnknownMethodError(TreeTimeError):
-    """MissingDataError class raised when tree or alignment are missing"""
+    """MissingDataError class raised when an unknown method is called"""
     pass
 
 class NotReadyError(TreeTimeError):
     """NotReadyError class raised when results are requested before inference"""
     pass
 
+class TreeTimeUnknownError(Exception):
+    """TreeTimeUnknownError class raised when TreeTime fails during inference due to an unknown reason. This might be due to data not fulfilling base assumptions or due  to bugs in TreeTime. Please report them to the developers if they persist."""
+    pass
+
 
 from .treeanc import TreeAnc
 from .treetime import TreeTime, plot_vs_years


=====================================
treetime/argument_parser.py
=====================================
@@ -79,10 +79,11 @@ reroot_description = "Reroot the tree using root-to-tip regression. Valid choice
     "By default, TreeTime will reroot using 'least-squares'. "\
     "Use --keep-root to keep the current root."
 
-tree_description = "Name of file containing the tree in "\
-    "newick, nexus, or phylip format. If none is provided, "\
-    "treetime will attempt to build a tree from the alignment "\
-    "using fasttree, iqtree, or raxml (assuming they are installed)"
+tree_description = "Name of file containing the tree in newick, nexus, or phylip format, "\
+    "the branch length of the tree should be in units of average number of nucleotide or protein "\
+    "substitutions per site. If no file is provided, treetime will attempt "\
+    "to build a tree from the alignment using fasttree, iqtree, or raxml "\
+    "(assuming they are installed). "
 
 aln_description = "alignment file (fasta)"
 


=====================================
treetime/clock_tree.py
=====================================
@@ -1,6 +1,6 @@
 import numpy as np
 from . import config as ttconf
-from . import MissingDataError
+from . import MissingDataError, UnknownMethodError
 from .treeanc import TreeAnc
 from .utils import numeric_date, DateConversion, datestring_from_numeric
 from .distribution import Distribution
@@ -77,7 +77,7 @@ class ClockTree(TreeAnc):
         """
         super(ClockTree, self).__init__(*args, **kwargs)
         if dates is None:
-            raise ValueError("ClockTree requires date constraints!")
+            raise MissingDataError("ClockTree requires date constraints!")
 
         self.debug=debug
         self.real_dates = real_dates
@@ -203,7 +203,7 @@ class ClockTree(TreeAnc):
                         " fft_grid_points=%.3e"%(precision_fft), 2)
                 self.fft_grid_size = precision_fft
             elif precision_fft!='auto':
-                raise ValueError(f"ClockTree: precision_fft needs to be either 'auto' or an integer, got '{precision_fft}'.")
+                raise UnknownMethodError(f"ClockTree: precision_fft needs to be either 'auto' or an integer, got '{precision_fft}'.")
             else:
                 self.fft_grid_size = ttconf.FFT_FWHM_GRID_SIZE
             if type(precision_branch) is int:
@@ -244,7 +244,7 @@ class ClockTree(TreeAnc):
         branch_value = lambda x:x.mutation_length
         if covariation:
             om = self.one_mutation
-            branch_variance = lambda x:((x.clock_length if hasattr(x,'clock_length') else x.mutation_length)
+            branch_variance = lambda x:((max(0,x.clock_length) if hasattr(x,'clock_length') else x.mutation_length)
                                         +(self.tip_slack**2*om if x.is_terminal() else 0.0))*om
         else:
             branch_variance = lambda x:1.0 if x.is_terminal() else 0.0
@@ -256,6 +256,7 @@ class ClockTree(TreeAnc):
 
 
     def get_clock_model(self, covariation=True, slope=None):
+        self.logger(f'ClockTree.get_clock_model: estimating clock model with covariation={covariation}',3)
         Treg = self.setup_TreeRegression(covariation=covariation)
         self.clock_model = Treg.regression(slope=slope)
         if not np.isfinite(self.clock_model['slope']):
@@ -596,7 +597,7 @@ class ClockTree(TreeAnc):
                     pass
 
         method = 'FFT' if self.use_fft else 'explicit'
-        self.logger("ClockTree - Marginal reconstruction:  Propagating leaves -> root...", 2)
+        self.logger(f"ClockTree - Marginal reconstruction using {method} convolution:  Propagating leaves -> root...", 2)
         # go through the nodes from leaves towards the root:
         for node in self.tree.find_clades(order='postorder'):  # children first, msg to parents
             if node.bad_branch:
@@ -640,7 +641,7 @@ class ClockTree(TreeAnc):
                     ## for k branches, but due to the fact that inner branches overlap at time t one can be removed
                     ## resulting in the exponent (k-1))
                     if hasattr(self, 'merger_model') and self.merger_model:
-                        time_points = np.unique(np.concatenate([msg.x for msg in msgs_to_multiply]))
+                        time_points = node.product_of_child_messages.x
                         # set multiplicity of node to number of good child branches
                         if node.is_terminal():
                             merger_contribution = Distribution(time_points, -self.merger_model.integral_merger_rate(time_points), is_log=True)
@@ -701,7 +702,7 @@ class ClockTree(TreeAnc):
                         complementary_msgs.append(parent.msg_from_parent)
 
                     if hasattr(self, 'merger_model') and self.merger_model:
-                        time_points = np.unique(np.concatenate([msg.x for msg in complementary_msgs]))
+                        time_points = parent.marginal_pos_LH.x
                         # As Lx do not include the node contribution this must be added on
                         complementary_msgs.append(self.merger_model.node_contribution(parent, time_points))
 


=====================================
treetime/distribution.py
=====================================
@@ -49,6 +49,10 @@ class Distribution(object):
         # 0.69... is log(2), there is always one value for which this is true since
         # the minimum is subtracted
         tmp = np.where(log_prob < 0.693147)[0]
+        if len(tmp)==0:
+            raise ValueError("Error in computing the FWHM for the distribution. This is "
+                    "most likely caused by incorrect input data.")
+
         x_l, x_u = tmp[0], tmp[-1]
         if L < 2:
             print ("Not enough points to compute FWHM: returning zero")
@@ -98,6 +102,14 @@ class Distribution(object):
 
             x_vals = np.unique(np.concatenate([k.x for k in dists]))
             x_vals = x_vals[(x_vals> new_xmin-TINY_NUMBER)&(x_vals< new_xmax+TINY_NUMBER)]
+            n_dists = len(dists)
+            if len(x_vals)>100*n_dists and n_dists>3:
+                n_bins = len(x_vals)//n_dists - 6
+                lower_cut_off = n_dists*3
+                upper_cut_off = n_dists*(n_bins + 3)
+                x_vals = np.concatenate((x_vals[:lower_cut_off],
+                                         x_vals[lower_cut_off:upper_cut_off].reshape((-1,n_dists)).mean(axis=1),
+                                         x_vals[upper_cut_off:]))
             y_vals = np.sum([k.__call__(x_vals) for k in dists], axis=0)
             try:
                 peak = y_vals.min()


=====================================
treetime/gtr.py
=====================================
@@ -1,6 +1,6 @@
 from collections import defaultdict
 import numpy as np
-from . import config as ttconf
+from . import config as ttconf, TreeTimeError, MissingDataError
 from .seq_utils import alphabets, profile_maps, alphabet_synonyms
 
 def avg_transition(W,pi, gap_index=None):
@@ -951,7 +951,12 @@ class GTR(object):
             Array of values exp(lambda(i) * t),
             where (i) - alphabet index (the eigenvalue number).
         """
-        return np.exp(self.mu * t * self.eigenvals)
+        log_val = self.mu * t * self.eigenvals
+        if any(i > 10 for i in log_val):
+            raise ValueError("Error in computing exp(Q * t): Q has positive eigenvalues or the branch length t is too large. "
+                    "This is most likely caused by incorrect input data.")
+
+        return np.exp(log_val)
 
 
     def expQt(self, t):


=====================================
treetime/treeanc.py
=====================================
@@ -63,8 +63,10 @@ class TreeAnc(object):
         Parameters
         ----------
         tree : str, Bio.Phylo.Tree
-           Phylogenetic tree. String passed is interpreted as a filename with
-           a tree in a standard format that can be parsed by the Biopython Phylo module.
+            Phylogenetic tree. String passed is interpreted as a filename with
+            a tree in a standard format that can be parsed by the Biopython Phylo module.
+            Branch length should be in units of average number of nucleotide or protein
+            substitutions per site. Use on trees with longer branches (>4) is not recommended.
 
         aln : str, Bio.Align.MultipleSequenceAlignment, dict
            Sequence alignment. If a string passed, it is interpreted as the
@@ -145,7 +147,7 @@ class TreeAnc(object):
         self._tree = None
         self.tree = tree
         if tree is None:
-            raise ValueError("TreeAnc: tree loading failed! exiting")
+            raise MissingDataError("TreeAnc: tree loading failed! exiting")
 
         # set up GTR model
         self._gtr = None
@@ -309,20 +311,28 @@ class TreeAnc(object):
                 if fmt in ['nexus', 'nex']:
                     self._tree=Phylo.read(in_tree, 'nexus')
                 else:
-                    raise ValueError('TreeAnc: could not load tree, format needs to be nexus or newick! input was '+str(in_tree))
+                    raise MissingDataError('TreeAnc: could not load tree, format needs to be nexus or newick! input was '+str(in_tree))
         else:
-            raise ValueError('TreeAnc: could not load tree! input was '+str(in_tree))
+            raise MissingDataError('TreeAnc: could not load tree! input was '+str(in_tree))
 
         if self._tree.count_terminals()<3:
-            raise ValueError('TreeAnc: tree in %s as only %d tips. Please check your tree!'%(str(in_tree), self._tree.count_terminals()))
+            raise MissingDataError('TreeAnc: tree in %s as only %d tips. Please check your tree!'%(str(in_tree), self._tree.count_terminals()))
 
         # remove all existing sequence attributes
+        branch_length_warning = False
         for node in self._tree.find_clades():
             node.branch_length = node.branch_length if node.branch_length else 0.0
+            if node.branch_length > ttconf.MAX_BRANCH_LENGTH:
+                branch_length_warning = True
             if hasattr(node, "_cseq"):
                 node.__delattr__("_cseq")
             node.original_length = node.branch_length
             node.mutation_length = node.branch_length
+        if branch_length_warning:
+            self.logger("WARNING: TreeTime has detected branches that are longer than %d. "
+                        "TreeTime requires trees where branch length is in units of average number "
+                        "of nucleotide or protein substitutions per site. "
+                        "Use on trees with longer branches is not recommended for ancestral sequence reconstruction."%(ttconf.MAX_BRANCH_LENGTH), 0, warn=True)
         self.prepare_tree()
 
         if self.data:
@@ -518,7 +528,7 @@ class TreeAnc(object):
         elif method.lower() in ['fitch', 'parsimony']:
             _ml_anc = self._fitch_anc
         else:
-            raise ValueError("Reconstruction method needs to be in ['ml', 'probabilistic', 'fitch', 'parsimony'], got '{}'".format(method))
+            raise UnknownMethodError("Reconstruction method needs to be in ['ml', 'probabilistic', 'fitch', 'parsimony'], got '{}'".format(method))
 
         if infer_gtr:
             self.infer_gtr(marginal=marginal, **kwargs)


=====================================
treetime/treeregression.py
=====================================
@@ -119,7 +119,7 @@ class TreeRegression(object):
     def Cov(self):
         """
         calculate the covariance matrix of the tips assuming variance
-        has accumulated along branches of the tree accoriding to the
+        has accumulated along branches of the tree according to the
         the provided
         Returns
         -------
@@ -347,7 +347,7 @@ class TreeRegression(object):
                     best_root.update(reg)
 
         if 'node' not in best_root:
-            print("TreeRegression.find_best_root: No valid root found!", force_positive)
+            print(f"TreeRegression.find_best_root: No valid root found! force_positive={force_positive}")
             return None
 
         if 'hessian' in best_root:


=====================================
treetime/treetime.py
=====================================
@@ -2,7 +2,7 @@ import numpy as np
 from scipy import optimize as sciopt
 from Bio import Phylo
 from . import config as ttconf
-from . import MissingDataError,UnknownMethodError,NotReadyError,TreeTimeError
+from . import MissingDataError,UnknownMethodError,NotReadyError,TreeTimeError, TreeTimeUnknownError
 from .utils import tree_layout
 from .clock_tree import ClockTree
 
@@ -25,7 +25,7 @@ def reduce_time_marginal_argument(input_time_marginal):
     elif input_time_marginal == 'confidence-only':
         return input_time_marginal
     else:
-        raise ValueError(f"'{input_time_marginal}' is not a known time marginal argument")
+        raise UnknownMethodError(f"'{input_time_marginal}' is not a known time marginal argument")
 
 
 class TreeTime(ClockTree):
@@ -51,7 +51,29 @@ class TreeTime(ClockTree):
         super(TreeTime, self).__init__(*args, **kwargs)
 
 
-    def run(self, root=None, infer_gtr=True, relaxed_clock=None, n_iqd = None,
+    def run(self, raise_uncaught_exceptions=False, **kwargs):
+        import sys
+        try:
+            return self._run(**kwargs)
+        except TreeTimeError as err:
+            if raise_uncaught_exceptions:
+                raise err
+            else:
+                print(f"ERROR: {err} \n", file=sys.stderr)
+                sys.exit(2)
+        except BaseException as err:
+            import traceback
+            print(traceback.format_exc(), file=sys.stderr)
+            print(f"ERROR: {err} \n ", file=sys.stderr)
+            print("ERROR in TreeTime.run: An error occurred which was not properly handled in TreeTime. If this error persists, please let us know "
+                    "by filing a new issue including the original command and the error above at: https://github.com/neherlab/treetime/issues \n", file=sys.stderr)
+            if raise_uncaught_exceptions:
+                raise TreeTimeUnknownError() from err
+            else:
+                sys.exit(2)
+
+
+    def _run(self, root=None, infer_gtr=True, relaxed_clock=None, n_iqd = None,
             resolve_polytomies=True, max_iter=0, Tc=None, fixed_clock_rate=None,
             time_marginal='never', sequence_marginal=False, branch_length_mode='auto',
             vary_rate=False, use_covariation=False, tracelog_file=None,
@@ -177,7 +199,7 @@ class TreeTime(ClockTree):
             time_marginal=kwargs["do_marginal"]
 
         if assign_gamma and relaxed_clock:
-            raise TreeTimeError("assign_gamma and relaxed clock are incompatible arguments")
+            raise UnknownMethodError("assign_gamma and relaxed clock are incompatible arguments")
 
         # initially, infer ancestral sequences and infer gtr model if desired
         if self.branch_length_mode=='input':


=====================================
treetime/utils.py
=====================================
@@ -2,7 +2,7 @@ import os,sys
 import datetime
 import pandas as pd
 import numpy as np
-from . import TreeTimeError
+from . import TreeTimeError, MissingDataError
 
 class DateConversion(object):
     """
@@ -263,11 +263,11 @@ def parse_dates(date_file, name_col=None, date_col=None):
                 potential_index_columns.append((ci, col))
 
         if date_col and date_col not in df.columns:
-            raise TreeTimeError("ERROR: specified column for dates does not exist. \n\tAvailable columns are: "\
+            raise MissingDataError("ERROR: specified column for dates does not exist. \n\tAvailable columns are: "\
                                 +", ".join(df.columns)+"\n\tYou specified '%s'"%date_col)
 
         if name_col and name_col not in df.columns:
-            raise TreeTimeError("ERROR: specified column for the taxon name does not exist. \n\tAvailable columns are: "\
+            raise MissingDataError("ERROR: specified column for the taxon name does not exist. \n\tAvailable columns are: "\
                                 +", ".join(df.columns)+"\n\tYou specified '%s'"%name_col)
 
 
@@ -275,7 +275,7 @@ def parse_dates(date_file, name_col=None, date_col=None):
         # if a potential numeric date column was found, use it
         # (use the first, if there are more than one)
         if not (len(potential_index_columns) or name_col):
-            raise TreeTimeError("ERROR: Cannot read metadata: need at least one column that contains the taxon labels."
+            raise MissingDataError("ERROR: Cannot read metadata: need at least one column that contains the taxon labels."
                   " Looking for the first column that contains 'name', 'strain', or 'accession' in the header.")
         else:
             # use the first column that is either 'name', 'strain', 'accession'
@@ -320,10 +320,10 @@ def parse_dates(date_file, name_col=None, date_col=None):
                             dates[k] = [numeric_date(x) for x in [lower, upper]]
 
         else:
-            raise TreeTimeError("ERROR: Metadata file has no column which looks like a sampling date!")
+            raise MissingDataError("ERROR: Metadata file has no column which looks like a sampling date!")
 
         if all(v is None for v in dates.values()):
-            raise TreeTimeError("ERROR: Cannot parse dates correctly! Check date format.")
+            raise MissingDataError("ERROR: Cannot parse dates correctly! Check date format.")
         return dates
     except TreeTimeError as err:
         raise err


=====================================
treetime/wrappers.py
=====================================
@@ -8,7 +8,7 @@ from . import TreeAnc, GTR, TreeTime
 from . import utils
 from .vcf_utils import read_vcf, write_vcf
 from .seq_utils import alphabets
-from . import TreeTimeError, MissingDataError
+from . import TreeTimeError, MissingDataError, UnknownMethodError
 from .treetime import reduce_time_marginal_argument
 
 def assure_tree(params, tmp_dir='treetime_tmp'):
@@ -794,7 +794,7 @@ def reconstruct_discrete_traits(tree, traits, missing_data='?', pc=1.0, sampling
         if len(missing_weights)>0.5*n_observed_states:
             print("More than half of discrete states missing from the weights file")
             print("Weights read from file are:", weights)
-            raise TreeTimeError("More than half of discrete states missing from the weights file")
+            raise MissingDataError("More than half of discrete states missing from the weights file")
 
     unique_states=sorted(unique_states)
     # make a map from states (excluding missing data) to characters in the alphabet
@@ -1022,7 +1022,7 @@ def estimate_clock_model(params):
         try:
             res = myTree.reroot(params.reroot,
                       force_positive=not params.allow_negative_rate)
-        except TreeTimeError as e:
+        except UnknownMethodError as e:
             print("ERROR: unknown root or rooting mechanism!")
             raise e
 
@@ -1071,7 +1071,7 @@ def estimate_clock_model(params):
                     tmp_str = ''
                 ofile.write("%s, %s, %f, %f\n"%(n.name, tmp_str, n.dist2root, clock_deviation))
             else:
-                ofile.write("%s, %f, %f, %f\n"%(n.name, d2d.numdate_from_dist2root(n.dist2root), n.dist2root, clock_deviation))
+                ofile.write("%s, %f, %f, 0.0\n"%(n.name, d2d.numdate_from_dist2root(n.dist2root), n.dist2root))
         for n in myTree.tree.get_nonterminals(order='preorder'):
             ofile.write("%s, %f, %f, 0.0\n"%(n.name, d2d.numdate_from_dist2root(n.dist2root), n.dist2root))
         print("--- wrote dates and root-to-tip distances to \n\t%s\n"%table_fname)



View it on GitLab: https://salsa.debian.org/med-team/python-treetime/-/commit/83154f17f178aa1a2708793598176597b988076f

-- 
View it on GitLab: https://salsa.debian.org/med-team/python-treetime/-/commit/83154f17f178aa1a2708793598176597b988076f
You're receiving this email because of your account on salsa.debian.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20221011/352591ff/attachment-0001.htm>


More information about the debian-med-commit mailing list