[med-svn] [Git][med-team/python-treetime][master] 5 commits: New upstream version 0.7.5

Sun May 3 19:14:17 BST 2020


Steffen Möller pushed to branch master at Debian Med / python-treetime


Commits:
b3255387 by Steffen Moeller at 2020-05-03T20:08:48+02:00
New upstream version 0.7.5
- - - - -
f02916fa by Steffen Moeller at 2020-05-03T20:08:48+02:00
routine-update: New upstream version

- - - - -
dccc9edf by Steffen Moeller at 2020-05-03T20:08:49+02:00
Update upstream source from tag 'upstream/0.7.5'

Update to upstream version '0.7.5'
with Debian dir bccbc8a6205dec89c593ca6505202b881f20b43b
- - - - -
0046a67e by Steffen Moeller at 2020-05-03T20:08:49+02:00
routine-update: Standards-Version: 4.5.0

- - - - -
3888fb13 by Steffen Moeller at 2020-05-03T20:13:43+02:00
Preparing for upload.

- - - - -


15 changed files:

- changelog.md
- debian/changelog
- debian/control
- − debian/patches/fix_test_indent.patch
- − debian/patches/series
- treetime/__init__.py
- treetime/distribution.py
- treetime/gtr.py
- treetime/merger_models.py
- treetime/sequence_data.py
- − treetime/test_opt.py
- treetime/treeanc.py
- treetime/treetime.py
- treetime/utils.py
- treetime/wrappers.py


Changes:

=====================================
changelog.md
=====================================
@@ -1,3 +1,31 @@
+# 0.7.5 -- fix desync of peak from grid of distributions after pruning 
+
+# 0.7.4 -- bug fix in reconstruct discrete trait routine
+
+The `reconstruct_discrete_traits` wrapper function didn't handle missing data correctly (after the changed released in 0.7.2) which resulted in alphabets and weights of different lengths.
+
+
+# 0.7.3 -- bug fix in average rate calculation
+
+This release fixes a problem that surfaced when inferring GTR models from trees of very similar sequences but quite a few gaps. This resulted in mutation counts like so:
+
+A: [[ 0.  1.  8.  3.  0.]
+C:  [ 1.  0.  2.  7.  0.]
+G:  [ 9.  0.  0.  2.  0.]
+T:  [ 1. 23.  6.  0.  0.]
+-:  [46. 22. 28. 38.  0.]]
+
+As a result, the rate "to gap" is inferred quite high, while the equilibrium gap fraction is low. Since we cap the equilibrium gap fraction from below to avoid reconstruction problems when branches are very short, this resulted in an average rate that had substantial contribution from and assumed 1% equilibrum gap frequency where gaps mutate at 20times the rate as others. Since gaps are ignored in distance calculations anyway, it is more sensible to exclude these transitions from the calculation of the average rate. This is now happening in line 7 of treetime/gtr.py. The average rate is restricted to mutation substitutions from non-gap states to any state.
+
+
+# 0.7.2 -- weights in discrete trait reconstruction
+This release implements a more consistent handling of weights (fixed equilibrium frequencies) in discrete state reconstruction.
+It also fixes a number of problems in who the arguments were processed.
+TreeTime now allows
+ * unobserved discrete states
+ * uses expected time-in-tree instead of observed time-in-tree in GTR estimation when weights are fixed. The former resulted in very unstable rate estimates.
+
+
 # 0.7.0 -- restructuring
 
 ## Major changes


=====================================
debian/changelog
=====================================
@@ -1,3 +1,12 @@
+python-treetime (0.7.5-1) UNRELEASED; urgency=medium
+
+  * Team upload.
+  * New upstream version
+  * Standards-Version: 4.5.0 (routine-update)
+  * d/control: Added "Rules-Requires-Root: no"
+
+ -- Steffen Moeller <moeller at debian.org>  Sun, 03 May 2020 20:08:48 +0200
+
 python-treetime (0.7.1-1) unstable; urgency=medium
 
   * Team upload.


=====================================
debian/control
=====================================
@@ -12,10 +12,11 @@ Build-Depends: debhelper-compat (= 12),
                python3-pandas,
                python3-scipy,
                python3-setuptools
-Standards-Version: 4.4.1
+Standards-Version: 4.5.0
 Vcs-Browser: https://salsa.debian.org/med-team/python-treetime
 Vcs-Git: https://salsa.debian.org/med-team/python-treetime.git
 Homepage: https://github.com/neherlab/treetime
+Rules-Requires-Root: no
 
 Package: python3-treetime
 Architecture: all


=====================================
debian/patches/fix_test_indent.patch deleted
=====================================
@@ -1,232 +0,0 @@
-Index: python-treetime/treetime/test_opt.py
-===================================================================
---- python-treetime.orig/treetime/test_opt.py
-+++ python-treetime/treetime/test_opt.py
-@@ -1,22 +1,83 @@
--    def optimize_tree_marginal_new(self, damping=0.5):
--        L = self.data.compressed_length
--        n_states = self.gtr.alphabet.shape[0]
--        # propagate leaves --> root, set the marginal-likelihood messages
--        for node in self.tree.find_clades(order='postorder'): #leaves -> root
--            if node.up is None and len(node.clades)==2:
--                continue
--
--            profiles = [c.marginal_subtree_LH for c in node] + [node.marginal_outgroup_LH]
--            bls = [c.branch_length for c in nodes] + [node.branch_length]
--            new_bls = self.optimize_star(profiles,bls, last_is_root=node.up is None)
--
--            # regardless of what was before, set the profile to ones
-+def optimize_tree_marginal_new(self, damping=0.5):
-+    L = self.data.compressed_length
-+    n_states = self.gtr.alphabet.shape[0]
-+    # propagate leaves --> root, set the marginal-likelihood messages
-+    for node in self.tree.find_clades(order='postorder'): #leaves -> root
-+        if node.up is None and len(node.clades)==2:
-+            continue
-+
-+        profiles = [c.marginal_subtree_LH for c in node] + [node.marginal_outgroup_LH]
-+        bls = [c.branch_length for c in nodes] + [node.branch_length]
-+        new_bls = self.optimize_star(profiles,bls, last_is_root=node.up is None)
-+
-+        # regardless of what was before, set the profile to ones
-+        tmp_log_subtree_LH = np.zeros((L,n_states), dtype=float)
-+        node.marginal_subtree_LH_prefactor = np.zeros(L, dtype=float)
-+        for ch in ci,node.clades:
-+            ch.branch_length = new_bls[ci]
-+            ch.marginal_log_Lx = self.gtr.propagate_profile(ch.marginal_subtree_LH,
-+                                                            ch.branch_length, return_log=True)
-+            tmp_log_subtree_LH += ch.marginal_log_Lx
-+            node.marginal_subtree_LH_prefactor += ch.marginal_subtree_LH_prefactor
-+
-+        node.marginal_subtree_LH, offset = normalize_profile(tmp_log_subtree_LH, log=True)
-+        node.marginal_subtree_LH_prefactor += offset # and store log-prefactor
-+
-+        if node.up:
-+            node.marginal_log_Lx = self.gtr.propagate_profile(node.marginal_subtree_LH,
-+                                            node.branch_length, return_log=True) # raw prob to transfer prob up
-+            tmp_msg_from_parent = self.gtr.evolve(node.marginal_outgroup_LH,
-+                                             self._branch_length_to_gtr(node), return_log=False)
-+            node.marginal_profile, pre = normalize_profile(node.marginal_subtree_LH * tmp_msg_from_parent, return_offset=False)
-+        else:
-+            node.marginal_profile, pre = normalize_profile(node.marginal_subtree_LH * node.marginal_outgroup_LH, return_offset=False)
-+
-+    root=self.tree.root
-+    print(len(root.clades))
-+    if len(root.clades)==2:
-+        tmp_log_subtree_LH = np.zeros((L,n_states), dtype=float)
-+        root.marginal_subtree_LH_prefactor = np.zeros(L, dtype=float)
-+        old_bl = root.clades[0].branch_length + root.clades[1]
-+        bl = self.gtr.optimal_t_compressed((root.clades[0].marginal_subtree_LH*root.marginal_outgroup_LH,
-+                                            root.clades[1].marginal_subtree_LH), multiplicity=self.data.multiplicity,
-+                                            profiles=True, tol=1e-8)
-+        for ch in root:
-+            ch.branch_length *= ((1-damping)*old_bl + damping*bl)/old_bl
-+            ch.marginal_log_Lx = self.gtr.propagate_profile(ch.marginal_subtree_LH,
-+                                        ch.branch_length, return_log=True) # raw prob to transfer prob up
-+            tmp_log_subtree_LH += ch.marginal_log_Lx
-+            root.marginal_subtree_LH_prefactor += ch.marginal_subtree_LH_prefactor
-+
-+        root.marginal_subtree_LH, offset = normalize_profile(tmp_log_subtree_LH, log=True)
-+        root.marginal_subtree_LH_prefactor += offset # and store log-prefactor
-+
-+
-+    self.total_LH_and_root_sequence(assign_sequence=False)
-+    self.preorder_traversal_marginal(assign_sequence=False, reconstruct_leaves=False)
-+
-+
-+
-+
-+def optimize_tree_marginal_new2(self, n_iter_internal=2, damping=0.5):
-+    L = self.data.compressed_length
-+    n_states = self.gtr.alphabet.shape[0]
-+    # propagate leaves --> root, set the marginal-likelihood messages
-+    for node in self.tree.get_nonterminals(order='postorder'): #leaves -> root
-+        if node.up is None and len(node.clades)==2:
-+            continue
-+        # regardless of what was before, set the profile to ones
-+        for ii in range(n_iter_internal):
-+            damp = damping**(1+ii)
-             tmp_log_subtree_LH = np.zeros((L,n_states), dtype=float)
-             node.marginal_subtree_LH_prefactor = np.zeros(L, dtype=float)
--            for ch in ci,node.clades:
--                ch.branch_length = new_bls[ci]
-+            for ch in node.clades:
-+                outgroup = np.exp(np.log(np.maximum(ttconf.TINY_NUMBER, node.marginal_profile)) - ch.marginal_log_Lx)
-+
-+                bl = self.gtr.optimal_t_compressed((ch.marginal_subtree_LH, outgroup), multiplicity=self.data.multiplicity, profiles=True, tol=1e-8)
-+                new_bl = (1-damp)*bl + damp*ch.branch_length
-+                ch.branch_length=new_bl
-                 ch.marginal_log_Lx = self.gtr.propagate_profile(ch.marginal_subtree_LH,
--                                                                ch.branch_length, return_log=True)
-+                                            new_bl, return_log=True) # raw prob to transfer prob up
-                 tmp_log_subtree_LH += ch.marginal_log_Lx
-                 node.marginal_subtree_LH_prefactor += ch.marginal_subtree_LH_prefactor
- 
-@@ -24,101 +85,40 @@
-             node.marginal_subtree_LH_prefactor += offset # and store log-prefactor
- 
-             if node.up:
-+                bl = self.gtr.optimal_t_compressed((node.marginal_subtree_LH, node.marginal_outgroup_LH), multiplicity=self.data.multiplicity, profiles=True, tol=1e-8)
-+                new_bl = (1-damp)*bl + damp*node.branch_length
-+                node.branch_length=new_bl
-                 node.marginal_log_Lx = self.gtr.propagate_profile(node.marginal_subtree_LH,
--                                                node.branch_length, return_log=True) # raw prob to transfer prob up
-+                                                new_bl, return_log=True) # raw prob to transfer prob up
-+                node.marginal_outgroup_LH, pre = normalize_profile(np.log(np.maximum(ttconf.TINY_NUMBER, node.up.marginal_profile)) - node.marginal_log_Lx,
-+                                             log=True, return_offset=False)
-+
-                 tmp_msg_from_parent = self.gtr.evolve(node.marginal_outgroup_LH,
-                                                  self._branch_length_to_gtr(node), return_log=False)
-                 node.marginal_profile, pre = normalize_profile(node.marginal_subtree_LH * tmp_msg_from_parent, return_offset=False)
-             else:
-                 node.marginal_profile, pre = normalize_profile(node.marginal_subtree_LH * node.marginal_outgroup_LH, return_offset=False)
- 
--        root=self.tree.root
--        print(len(root.clades))
--        if len(root.clades)==2:
--            tmp_log_subtree_LH = np.zeros((L,n_states), dtype=float)
--            root.marginal_subtree_LH_prefactor = np.zeros(L, dtype=float)
--            old_bl = root.clades[0].branch_length + root.clades[1]
--            bl = self.gtr.optimal_t_compressed((root.clades[0].marginal_subtree_LH*root.marginal_outgroup_LH,
--                                                root.clades[1].marginal_subtree_LH), multiplicity=self.data.multiplicity,
--                                                profiles=True, tol=1e-8)
--            for ch in root:
--                ch.branch_length *= ((1-damping)*old_bl + damping*bl)/old_bl
--                ch.marginal_log_Lx = self.gtr.propagate_profile(ch.marginal_subtree_LH,
--                                            ch.branch_length, return_log=True) # raw prob to transfer prob up
--                tmp_log_subtree_LH += ch.marginal_log_Lx
--                root.marginal_subtree_LH_prefactor += ch.marginal_subtree_LH_prefactor
--
--            root.marginal_subtree_LH, offset = normalize_profile(tmp_log_subtree_LH, log=True)
--            root.marginal_subtree_LH_prefactor += offset # and store log-prefactor
--
--
--        self.total_LH_and_root_sequence(assign_sequence=False)
--        self.preorder_traversal_marginal(assign_sequence=False, reconstruct_leaves=False)
- 
-+    root=self.tree.root
-+    print(len(root.clades))
-+    if len(root.clades)==2:
-+        tmp_log_subtree_LH = np.zeros((L,n_states), dtype=float)
-+        root.marginal_subtree_LH_prefactor = np.zeros(L, dtype=float)
-+        old_bl = root.clades[0].branch_length + root.clades[1]
-+        bl = self.gtr.optimal_t_compressed((root.clades[0].marginal_subtree_LH*root.marginal_outgroup_LH,
-+                                            root.clades[1].marginal_subtree_LH), multiplicity=self.data.multiplicity,
-+                                            profiles=True, tol=1e-8)
-+        for ch in root:
-+            ch.branch_length *= bl/old_bl
-+            ch.marginal_log_Lx = self.gtr.propagate_profile(ch.marginal_subtree_LH,
-+                                        ch.branch_length, return_log=True) # raw prob to transfer prob up
-+            tmp_log_subtree_LH += ch.marginal_log_Lx
-+            root.marginal_subtree_LH_prefactor += ch.marginal_subtree_LH_prefactor
- 
-+        root.marginal_subtree_LH, offset = normalize_profile(tmp_log_subtree_LH, log=True)
-+        root.marginal_subtree_LH_prefactor += offset # and store log-prefactor
- 
- 
--    def optimize_tree_marginal_new2(self, n_iter_internal=2, damping=0.5):
--        L = self.data.compressed_length
--        n_states = self.gtr.alphabet.shape[0]
--        # propagate leaves --> root, set the marginal-likelihood messages
--        for node in self.tree.get_nonterminals(order='postorder'): #leaves -> root
--            if node.up is None and len(node.clades)==2:
--                continue
--            # regardless of what was before, set the profile to ones
--            for ii in range(n_iter_internal):
--                damp = damping**(1+ii)
--                tmp_log_subtree_LH = np.zeros((L,n_states), dtype=float)
--                node.marginal_subtree_LH_prefactor = np.zeros(L, dtype=float)
--                for ch in node.clades:
--                    outgroup = np.exp(np.log(np.maximum(ttconf.TINY_NUMBER, node.marginal_profile)) - ch.marginal_log_Lx)
--
--                    bl = self.gtr.optimal_t_compressed((ch.marginal_subtree_LH, outgroup), multiplicity=self.data.multiplicity, profiles=True, tol=1e-8)
--                    new_bl = (1-damp)*bl + damp*ch.branch_length
--                    ch.branch_length=new_bl
--                    ch.marginal_log_Lx = self.gtr.propagate_profile(ch.marginal_subtree_LH,
--                                                new_bl, return_log=True) # raw prob to transfer prob up
--                    tmp_log_subtree_LH += ch.marginal_log_Lx
--                    node.marginal_subtree_LH_prefactor += ch.marginal_subtree_LH_prefactor
--
--                node.marginal_subtree_LH, offset = normalize_profile(tmp_log_subtree_LH, log=True)
--                node.marginal_subtree_LH_prefactor += offset # and store log-prefactor
--
--                if node.up:
--                    bl = self.gtr.optimal_t_compressed((node.marginal_subtree_LH, node.marginal_outgroup_LH), multiplicity=self.data.multiplicity, profiles=True, tol=1e-8)
--                    new_bl = (1-damp)*bl + damp*node.branch_length
--                    node.branch_length=new_bl
--                    node.marginal_log_Lx = self.gtr.propagate_profile(node.marginal_subtree_LH,
--                                                    new_bl, return_log=True) # raw prob to transfer prob up
--                    node.marginal_outgroup_LH, pre = normalize_profile(np.log(np.maximum(ttconf.TINY_NUMBER, node.up.marginal_profile)) - node.marginal_log_Lx,
--                                                 log=True, return_offset=False)
--
--                    tmp_msg_from_parent = self.gtr.evolve(node.marginal_outgroup_LH,
--                                                     self._branch_length_to_gtr(node), return_log=False)
--                    node.marginal_profile, pre = normalize_profile(node.marginal_subtree_LH * tmp_msg_from_parent, return_offset=False)
--                else:
--                    node.marginal_profile, pre = normalize_profile(node.marginal_subtree_LH * node.marginal_outgroup_LH, return_offset=False)
--
--
--        root=self.tree.root
--        print(len(root.clades))
--        if len(root.clades)==2:
--            tmp_log_subtree_LH = np.zeros((L,n_states), dtype=float)
--            root.marginal_subtree_LH_prefactor = np.zeros(L, dtype=float)
--            old_bl = root.clades[0].branch_length + root.clades[1]
--            bl = self.gtr.optimal_t_compressed((root.clades[0].marginal_subtree_LH*root.marginal_outgroup_LH,
--                                                root.clades[1].marginal_subtree_LH), multiplicity=self.data.multiplicity,
--                                                profiles=True, tol=1e-8)
--            for ch in root:
--                ch.branch_length *= bl/old_bl
--                ch.marginal_log_Lx = self.gtr.propagate_profile(ch.marginal_subtree_LH,
--                                            ch.branch_length, return_log=True) # raw prob to transfer prob up
--                tmp_log_subtree_LH += ch.marginal_log_Lx
--                root.marginal_subtree_LH_prefactor += ch.marginal_subtree_LH_prefactor
--
--            root.marginal_subtree_LH, offset = normalize_profile(tmp_log_subtree_LH, log=True)
--            root.marginal_subtree_LH_prefactor += offset # and store log-prefactor
--
--
--        self.total_LH_and_root_sequence(assign_sequence=False)
--        self.preorder_traversal_marginal(assign_sequence=False, reconstruct_leaves=False)
-+    self.total_LH_and_root_sequence(assign_sequence=False)
-+    self.preorder_traversal_marginal(assign_sequence=False, reconstruct_leaves=False)


=====================================
debian/patches/series deleted
=====================================
@@ -1 +0,0 @@
-fix_test_indent.patch


=====================================
treetime/__init__.py
=====================================
@@ -1,5 +1,5 @@
 from __future__ import print_function, division, absolute_import
-version="0.7.0"
+version="0.7.5"
 
 class TreeTimeError(Exception):
     """TreeTimeError class"""


=====================================
treetime/distribution.py
=====================================
@@ -265,6 +265,10 @@ class Distribution(object):
                 updated=False
                 n_iter+=1
 
+        self._peak_idx = self.__call__(self._func.x).argmin()
+        self._peak_pos = self._func.x[self._peak_idx]
+        self._peak_val = self.__call__(self.peak_pos)
+
 
     def prob(self,x):
         return np.exp(-1 * self.__call__(x))


=====================================
treetime/gtr.py
=====================================
@@ -4,6 +4,12 @@ import numpy as np
 from treetime import config as ttconf
 from .seq_utils import alphabets, profile_maps, alphabet_synonyms
 
+def avg_transition(W,pi, gap_index=None):
+    if gap_index is None:
+        return np.einsum('i,ij,j', pi, W, pi)
+    else:
+        return (np.einsum('i,ij,j', pi, W, pi) - np.sum(pi*W[:,gap_index])*pi[gap_index])/(1-pi[gap_index])
+
 
 class GTR(object):
     """
@@ -212,7 +218,7 @@ class GTR(object):
 
         self._W = 0.5*(W+W.T)
         np.fill_diagonal(W,0)
-        average_rate = W.dot(self.Pi).dot(self.Pi)
+        average_rate = avg_transition(W, self.Pi, gap_index=self.gap_index)
         self._W = W/average_rate
         self._mu *=average_rate
 
@@ -378,23 +384,27 @@ class GTR(object):
         from .aa_models  import JTT92
 
         if model.lower() in ['jc', 'jc69', 'jukes-cantor', 'jukes-cantor69', 'jukescantor', 'jukescantor69']:
-            return JC69(**kwargs)
+            model = JC69(**kwargs)
         elif model.lower() in ['k80', 'kimura80', 'kimura1980']:
-            return K80(**kwargs)
+            model = K80(**kwargs)
         elif model.lower() in ['f81', 'felsenstein81', 'felsenstein1981']:
-            return F81(**kwargs)
+            model = F81(**kwargs)
         elif model.lower() in ['hky', 'hky85', 'hky1985']:
-            return HKY85(**kwargs)
+            model = HKY85(**kwargs)
         elif model.lower() in ['t92', 'tamura92', 'tamura1992']:
-            return T92(**kwargs)
+            model = T92(**kwargs)
         elif model.lower() in ['tn93', 'tamura_nei_93', 'tamuranei93']:
-            return TN93(**kwargs)
+            model = TN93(**kwargs)
         elif model.lower() in ['jtt', 'jtt92']:
-            return JTT92(**kwargs)
+            model = JTT92(**kwargs)
         else:
             raise KeyError("The GTR model '{}' is not in the list of available models."
                 "".format(model))
 
+        model.mu = kwargs['mu'] if 'mu' in kwargs else 1.0
+        return model
+
+
     @classmethod
     def random(cls, mu=1.0, alphabet='nuc'):
         """
@@ -494,13 +504,16 @@ class GTR(object):
                                                     + ttconf.TINY_NUMBER + 2*pc_mat)
 
             np.fill_diagonal(W_ij, 0)
-            scale_factor = np.einsum('i,ij,j',pi,W_ij,pi)
+            scale_factor = avg_transition(W_ij,pi, gap_index=gtr.gap_index)
 
             W_ij = W_ij/scale_factor
             if fixed_pi is None:
                 pi = (np.sum(nij+pc_mat,axis=1)+root_state)/(ttconf.TINY_NUMBER + mu*np.dot(W_ij,Ti)+root_state.sum()+np.sum(pc_mat, axis=1))
                 pi /= pi.sum()
-            mu = nij.sum()/(ttconf.TINY_NUMBER + np.sum(pi * (W_ij.dot(Ti))))
+                mu = nij.sum()/(ttconf.TINY_NUMBER + np.sum(pi * (W_ij.dot(Ti))))
+            else:
+                mu = nij.sum()/(ttconf.TINY_NUMBER + np.sum(pi * (W_ij.dot(pi)))*Ti.sum())
+
         if count >= Nit:
             gtr.logger('WARNING: maximum number of iterations has been reached in GTR inference',3, warn=True)
             if LA.norm(pi_old-pi) > dp:
@@ -999,7 +1012,7 @@ class GTR(object):
                       for si,state in enumerate(self.alphabet)])
 
     def average_rate(self):
-        return -self.mu*np.einsum('ii,i',self.Q, self.Pi)
+        return self.mu*avg_transition(self.W, self.Pi, gap_index=self.gap_index)
 
     def save_to_npz(self, outfile):
         full_gtr = self.mu * np.dot(self.Pi, self.W)


=====================================
treetime/merger_models.py
=====================================
@@ -33,12 +33,14 @@ class Coalescent(object):
         Args:
             - Tc:   a float or an iterable, if iterable another argument T of same shape is required
             - T:    an array like of same shape as Tc that specifies the time pivots corresponding to Tc
+                    note that this array is ordered past to present corresponding to
+                    decreasing 'time before present' values
         Returns:
             - None
         '''
         if isinstance(Tc, Iterable):
             if len(Tc)==len(T):
-                x = np.concatenate(([-ttconf.BIG_NUMBER], T, [ttconf.BIG_NUMBER]))
+                x = np.concatenate(([ttconf.BIG_NUMBER], T, [-ttconf.BIG_NUMBER]))
                 y = np.concatenate(([Tc[0]], Tc, [Tc[-1]]))
                 self.Tc = interp1d(x,y)
             else:


=====================================
treetime/sequence_data.py
=====================================
@@ -161,11 +161,11 @@ class SequenceData(object):
 
         if type(in_aln) is MultipleSeqAlignment:
             # check whether the alignment is consistent with a nucleotide alignment.
-            self.check_alphabet([seq2array(s) for s in in_aln])
-            self.is_sparse = False
             self._aln = {s.name: seq2array(s, convert_upper=self.convert_upper,
                                            fill_overhangs=self.fill_overhangs, ambiguous=self.ambiguous)
                          for s in in_aln}
+            self.check_alphabet(list(self._aln.values()))
+            self.is_sparse = False
             self.logger("SequenceData: loaded alignment.",1)
         elif type(in_aln) in [dict, defaultdict]:
             self.logger("SequenceData: loaded sparse/vcf alignment.",1)


=====================================
treetime/test_opt.py deleted
=====================================
@@ -1,124 +0,0 @@
-    def optimize_tree_marginal_new(self, damping=0.5):
-        L = self.data.compressed_length
-        n_states = self.gtr.alphabet.shape[0]
-        # propagate leaves --> root, set the marginal-likelihood messages
-        for node in self.tree.find_clades(order='postorder'): #leaves -> root
-            if node.up is None and len(node.clades)==2:
-                continue
-
-            profiles = [c.marginal_subtree_LH for c in node] + [node.marginal_outgroup_LH]
-            bls = [c.branch_length for c in nodes] + [node.branch_length]
-            new_bls = self.optimize_star(profiles,bls, last_is_root=node.up is None)
-
-            # regardless of what was before, set the profile to ones
-            tmp_log_subtree_LH = np.zeros((L,n_states), dtype=float)
-            node.marginal_subtree_LH_prefactor = np.zeros(L, dtype=float)
-            for ch in ci,node.clades:
-                ch.branch_length = new_bls[ci]
-                ch.marginal_log_Lx = self.gtr.propagate_profile(ch.marginal_subtree_LH,
-                                                                ch.branch_length, return_log=True)
-                tmp_log_subtree_LH += ch.marginal_log_Lx
-                node.marginal_subtree_LH_prefactor += ch.marginal_subtree_LH_prefactor
-
-            node.marginal_subtree_LH, offset = normalize_profile(tmp_log_subtree_LH, log=True)
-            node.marginal_subtree_LH_prefactor += offset # and store log-prefactor
-
-            if node.up:
-                node.marginal_log_Lx = self.gtr.propagate_profile(node.marginal_subtree_LH,
-                                                node.branch_length, return_log=True) # raw prob to transfer prob up
-                tmp_msg_from_parent = self.gtr.evolve(node.marginal_outgroup_LH,
-                                                 self._branch_length_to_gtr(node), return_log=False)
-                node.marginal_profile, pre = normalize_profile(node.marginal_subtree_LH * tmp_msg_from_parent, return_offset=False)
-            else:
-                node.marginal_profile, pre = normalize_profile(node.marginal_subtree_LH * node.marginal_outgroup_LH, return_offset=False)
-
-        root=self.tree.root
-        print(len(root.clades))
-        if len(root.clades)==2:
-            tmp_log_subtree_LH = np.zeros((L,n_states), dtype=float)
-            root.marginal_subtree_LH_prefactor = np.zeros(L, dtype=float)
-            old_bl = root.clades[0].branch_length + root.clades[1]
-            bl = self.gtr.optimal_t_compressed((root.clades[0].marginal_subtree_LH*root.marginal_outgroup_LH,
-                                                root.clades[1].marginal_subtree_LH), multiplicity=self.data.multiplicity,
-                                                profiles=True, tol=1e-8)
-            for ch in root:
-                ch.branch_length *= ((1-damping)*old_bl + damping*bl)/old_bl
-                ch.marginal_log_Lx = self.gtr.propagate_profile(ch.marginal_subtree_LH,
-                                            ch.branch_length, return_log=True) # raw prob to transfer prob up
-                tmp_log_subtree_LH += ch.marginal_log_Lx
-                root.marginal_subtree_LH_prefactor += ch.marginal_subtree_LH_prefactor
-
-            root.marginal_subtree_LH, offset = normalize_profile(tmp_log_subtree_LH, log=True)
-            root.marginal_subtree_LH_prefactor += offset # and store log-prefactor
-
-
-        self.total_LH_and_root_sequence(assign_sequence=False)
-        self.preorder_traversal_marginal(assign_sequence=False, reconstruct_leaves=False)
-
-
-
-
-    def optimize_tree_marginal_new2(self, n_iter_internal=2, damping=0.5):
-        L = self.data.compressed_length
-        n_states = self.gtr.alphabet.shape[0]
-        # propagate leaves --> root, set the marginal-likelihood messages
-        for node in self.tree.get_nonterminals(order='postorder'): #leaves -> root
-            if node.up is None and len(node.clades)==2:
-                continue
-            # regardless of what was before, set the profile to ones
-            for ii in range(n_iter_internal):
-                damp = damping**(1+ii)
-                tmp_log_subtree_LH = np.zeros((L,n_states), dtype=float)
-                node.marginal_subtree_LH_prefactor = np.zeros(L, dtype=float)
-                for ch in node.clades:
-                    outgroup = np.exp(np.log(np.maximum(ttconf.TINY_NUMBER, node.marginal_profile)) - ch.marginal_log_Lx)
-
-                    bl = self.gtr.optimal_t_compressed((ch.marginal_subtree_LH, outgroup), multiplicity=self.data.multiplicity, profiles=True, tol=1e-8)
-                    new_bl = (1-damp)*bl + damp*ch.branch_length
-                    ch.branch_length=new_bl
-                    ch.marginal_log_Lx = self.gtr.propagate_profile(ch.marginal_subtree_LH,
-                                                new_bl, return_log=True) # raw prob to transfer prob up
-                    tmp_log_subtree_LH += ch.marginal_log_Lx
-                    node.marginal_subtree_LH_prefactor += ch.marginal_subtree_LH_prefactor
-
-                node.marginal_subtree_LH, offset = normalize_profile(tmp_log_subtree_LH, log=True)
-                node.marginal_subtree_LH_prefactor += offset # and store log-prefactor
-
-                if node.up:
-                    bl = self.gtr.optimal_t_compressed((node.marginal_subtree_LH, node.marginal_outgroup_LH), multiplicity=self.data.multiplicity, profiles=True, tol=1e-8)
-                    new_bl = (1-damp)*bl + damp*node.branch_length
-                    node.branch_length=new_bl
-                    node.marginal_log_Lx = self.gtr.propagate_profile(node.marginal_subtree_LH,
-                                                    new_bl, return_log=True) # raw prob to transfer prob up
-                    node.marginal_outgroup_LH, pre = normalize_profile(np.log(np.maximum(ttconf.TINY_NUMBER, node.up.marginal_profile)) - node.marginal_log_Lx,
-                                                 log=True, return_offset=False)
-
-                    tmp_msg_from_parent = self.gtr.evolve(node.marginal_outgroup_LH,
-                                                     self._branch_length_to_gtr(node), return_log=False)
-                    node.marginal_profile, pre = normalize_profile(node.marginal_subtree_LH * tmp_msg_from_parent, return_offset=False)
-                else:
-                    node.marginal_profile, pre = normalize_profile(node.marginal_subtree_LH * node.marginal_outgroup_LH, return_offset=False)
-
-
-        root=self.tree.root
-        print(len(root.clades))
-        if len(root.clades)==2:
-            tmp_log_subtree_LH = np.zeros((L,n_states), dtype=float)
-            root.marginal_subtree_LH_prefactor = np.zeros(L, dtype=float)
-            old_bl = root.clades[0].branch_length + root.clades[1]
-            bl = self.gtr.optimal_t_compressed((root.clades[0].marginal_subtree_LH*root.marginal_outgroup_LH,
-                                                root.clades[1].marginal_subtree_LH), multiplicity=self.data.multiplicity,
-                                                profiles=True, tol=1e-8)
-            for ch in root:
-                ch.branch_length *= bl/old_bl
-                ch.marginal_log_Lx = self.gtr.propagate_profile(ch.marginal_subtree_LH,
-                                            ch.branch_length, return_log=True) # raw prob to transfer prob up
-                tmp_log_subtree_LH += ch.marginal_log_Lx
-                root.marginal_subtree_LH_prefactor += ch.marginal_subtree_LH_prefactor
-
-            root.marginal_subtree_LH, offset = normalize_profile(tmp_log_subtree_LH, log=True)
-            root.marginal_subtree_LH_prefactor += offset # and store log-prefactor
-
-
-        self.total_LH_and_root_sequence(assign_sequence=False)
-        self.preorder_traversal_marginal(assign_sequence=False, reconstruct_leaves=False)


=====================================
treetime/treeanc.py
=====================================
@@ -312,6 +312,8 @@ class TreeAnc(object):
         else:
             raise ValueError('TreeAnc: could not load tree! input was '+str(in_tree))
 
+        if self._tree.count_terminals()<3:
+            raise ValueError('TreeAnc: tree in %s as only %d tips. Please check your tree!'%(str(in_tree), self._tree.count_terminals()))
 
         # remove all existing sequence attributes
         for node in self._tree.find_clades():
@@ -376,8 +378,8 @@ class TreeAnc(object):
                 self.logger("***WARNING: TreeAnc._attach_sequences_to_nodes: NO SEQUENCE FOR LEAF: %s" % l.name, 0, warn=True)
                 failed_leaves += 1
                 if failed_leaves > self.tree.count_terminals()/3:
-                    raise MissingDataError("TreeAnc._check_alignment_tree_gtr_consistency: At least 30\\% terminal nodes cannot be assigned a sequence!\n"
-                                           "Are you sure the alignment belongs to the tree?", 2, warn=True)
+                    raise MissingDataError("TreeAnc._check_alignment_tree_gtr_consistency: At least 30\% terminal nodes cannot be assigned a sequence!\n"
+                                           "Are you sure the alignment belongs to the tree?")
             else: # could not assign sequence for internal node - is OK
                 pass
 
@@ -1284,11 +1286,15 @@ class TreeAnc(object):
 
         """
         if branch_length_mode=='marginal':
-            return self.optimize_tree_marginal(max_iter=max_iter, infer_gtr=infer_gtr, pc=pc, **kwargs)
+            self.optimize_tree_marginal(max_iter=max_iter, infer_gtr=infer_gtr, pc=pc, **kwargs)
+            if prune_short:
+                self.prune_short_branches()
         elif branch_length_mode=='input':
             N_diff = self.reconstruct_anc(method='probabilistic', infer_gtr=infer_gtr, pc=pc,
                                           marginal=marginal_sequences, **kwargs)
-            return ttconf.success
+            if prune_short:
+                self.prune_short_branches()
+            return ttconf.SUCCESS
         elif branch_length_mode!='joint':
             raise UnknownMethodError("TreeAnc.optimize_tree: `branch_length_mode` should be in ['marginal', 'joint', 'input']")
 


=====================================
treetime/treetime.py
=====================================
@@ -441,7 +441,7 @@ class TreeTime(ClockTree):
             if isinstance(root,Phylo.BaseTree.Clade):
                 new_root = root
             elif isinstance(root, list):
-                new_root = self.tree.common_ancestor(*root)
+                new_root = self.tree.common_ancestor(root)
             elif root in self._leaves_lookup:
                 new_root = self._leaves_lookup[root]
             elif root=='oldest':


=====================================
treetime/utils.py
=====================================
@@ -238,7 +238,7 @@ def parse_dates(date_file, name_col=None, date_col=None):
 
     try:
         # read the metadata file into pandas dataframe.
-        df = pd.read_csv(date_file, sep=full_sep, engine='python', dtype='str')
+        df = pd.read_csv(date_file, sep=full_sep, engine='python', dtype='str', index_col=False)
         # check the metadata has strain names in the first column
         # look for the column containing sampling dates
         # We assume that the dates might be given either in human-readable format
@@ -293,7 +293,10 @@ def parse_dates(date_file, name_col=None, date_col=None):
                 k = row.loc[index_col]
                 # try parsing as a float first
                 try:
-                    dates[k] = float(date_str)
+                    if date_str:
+                        dates[k] = float(date_str)
+                    else:
+                        dates[k] = None
                     continue
                 except ValueError:
                     # try whether the date string can be parsed as [2002.2:2004.3]
@@ -318,7 +321,6 @@ def parse_dates(date_file, name_col=None, date_col=None):
 
         if all(v is None for v in dates.values()):
             raise TreeTimeError("ERROR: Cannot parse dates correctly! Check date format.")
-        print(dates)
         return dates
     except TreeTimeError as err:
         raise err


=====================================
treetime/wrappers.py
=====================================
@@ -13,7 +13,7 @@ from treetime import config as ttconf
 from treetime import utils
 from .vcf_utils import read_vcf, write_vcf
 from .seq_utils import alphabets
-from treetime import TreeTimeError
+from treetime import TreeTimeError, MissingDataError
 
 def assure_tree(params, tmp_dir='treetime_tmp'):
     """
@@ -30,7 +30,8 @@ def assure_tree(params, tmp_dir='treetime_tmp'):
 
     try:
         tt = TreeAnc(params.tree)
-    except:
+    except (ValueError, TreeTimeError, MissingDataError) as e:
+        print(e)
         print("Tree loading/building failed.")
         return 1
     return 0
@@ -711,42 +712,72 @@ def reconstruct_discrete_traits(tree, traits, missing_data='?', pc=1.0, sampling
     TreeTimeError
         raise error if ancestral reconstruction errors out
     """
-    unique_states = sorted(set(traits.values()))
-    nc = len(unique_states)
-    if nc>180:
-        print("mugration: can't have more than 180 states!", file=sys.stderr)
-        return None, None, None
-    elif nc<2:
-        print("mugration: only one or zero states found -- this doesn't make any sense", file=sys.stderr)
-        return None, None, None
-
     ###########################################################################
     ### make a single character alphabet that maps to discrete states
     ###########################################################################
-    alphabet = [chr(65+i) for i,state in enumerate(unique_states)]
-    missing_char = chr(65+nc)
-    letter_to_state = {a:unique_states[i] for i,a in enumerate(alphabet)}
+
+    unique_states = set(traits.values())
+    n_observed_states = len(unique_states)
+
+    # load weights from file and convert to dict if supplied as string
+    if type(weights)==str:
+        try:
+            tmp_weights = pd.read_csv(weights, sep='\t' if weights[-3:]=='tsv' else ',',
+                                 skipinitialspace=True)
+            weight_dict = {row[0]:row[1] for ri,row in tmp_weights.iterrows() if not np.isnan(row[1])}
+        except:
+            raise ValueError("Loading of weights file '%s' failed!"%weights)
+    elif type(weights)==dict:
+        weight_dict = weights
+    else:
+        weight_dict = None
+
+    # add weights to unique states for alphabet construction
+    if weight_dict is not None:
+        unique_states.update(weight_dict.keys())
+        missing_weights = [c for c in unique_states if c not in weight_dict and c is not missing_data]
+        if len(missing_weights):
+            print("Missing weights for values: " + ", ".join(missing_weights))
+
+        if len(missing_weights)>0.5*n_observed_states:
+            print("More than half of discrete states missing from the weights file")
+            print("Weights read from file are:", weights)
+            raise TreeTimeError("More than half of discrete states missing from the weights file")
+
+    unique_states=sorted(unique_states)
+    # make a map from states (excluding missing data) to characters in the alphabet
+    # note that gap character '-' is chr(45) and will never be included here
+    reverse_alphabet = {state:chr(65+i) for i,state in enumerate(unique_states) if state!=missing_data}
+    alphabet = list(reverse_alphabet.values())
+    # construct a look up from alphabet character to states
+    letter_to_state = {v:k for k,v in reverse_alphabet.items()}
+
+    # construct the vector with weights to be used as equilibrium frequency
+    if weight_dict is not None:
+        mean_weight = np.mean(list(weight_dict.values()))
+        weights = np.array([weight_dict[letter_to_state[c]] if letter_to_state[c] in weight_dict else mean_weight
+                            for c in alphabet], dtype=float)
+        weights/=weights.sum()
+
+    # consistency checks
+    if len(alphabet)<2:
+        print("mugration: only one or zero states found -- this doesn't make any sense", file=sys.stderr)
+        return None, None, None
+
+    n_states = len(alphabet)
+    missing_char = chr(65+n_states)
+    reverse_alphabet[missing_data]=missing_char
     letter_to_state[missing_char]=missing_data
-    reverse_alphabet = {v:k for k,v in letter_to_state.items()}
 
     ###########################################################################
     ### construct gtr model
     ###########################################################################
-    if type(weights)==str:
-        tmp_weights = pd.read_csv(weights, sep='\t' if weights[-3:]=='tsv' else ',',
-                             skipinitialspace=True)
-        weights = {row[0]:row[1] for ri,row in tmp_weights.iterrows()}
-        mean_weight = np.mean(list(weights.values()))
-        weights = np.array([weights[c] if c in weights else mean_weight for c in unique_states], dtype=float)
-        weights/=weights.sum()
-    else:
-        weights = None
 
     # set up dummy matrix
-    W = np.ones((nc,nc), dtype=float)
+    W = np.ones((n_states,n_states), dtype=float)
 
     mugration_GTR = GTR.custom(pi = weights, W=W, alphabet = np.array(alphabet))
-    mugration_GTR.profile_map[missing_char] = np.ones(nc)
+    mugration_GTR.profile_map[missing_char] = np.ones(n_states)
     mugration_GTR.ambiguous=missing_char
 
 



View it on GitLab: https://salsa.debian.org/med-team/python-treetime/-/compare/99b01c64180c229e8644139737f91aa09f0615ab...3888fb134beaebc8e17c8287b7fa406e7fb277d6

-- 
View it on GitLab: https://salsa.debian.org/med-team/python-treetime/-/compare/99b01c64180c229e8644139737f91aa09f0615ab...3888fb134beaebc8e17c8287b7fa406e7fb277d6
You're receiving this email because of your account on salsa.debian.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20200503/5504b1df/attachment-0001.html>