[med-svn] [Git][med-team/tnseq-transit][master] 6 commits: Fix watchfile to detect new versions on github

Tue Oct 12 15:30:22 BST 2021


Andreas Tille pushed to branch master at Debian Med / tnseq-transit


Commits:
2ff20610 by Andreas Tille at 2021-10-12T16:15:09+02:00
Fix watchfile to detect new versions on github

- - - - -
2d811ab5 by Andreas Tille at 2021-10-12T16:15:40+02:00
routine-update: New upstream version

- - - - -
bd6c5a95 by Andreas Tille at 2021-10-12T16:15:41+02:00
New upstream version 3.2.2
- - - - -
7f1f89a1 by Andreas Tille at 2021-10-12T16:18:43+02:00
Update upstream source from tag 'upstream/3.2.2'

Update to upstream version '3.2.2'
with Debian dir 394017818186cb57bcd30b5cf259cc4b0dd6e7e4
- - - - -
611941a1 by Andreas Tille at 2021-10-12T16:18:43+02:00
routine-update: Standards-Version: 4.6.0

- - - - -
ee0f9a77 by Andreas Tille at 2021-10-12T16:29:34+02:00
routine-update: Ready to upload to unstable

- - - - -


17 changed files:

- CHANGELOG.md
- debian/changelog
- debian/control
- debian/watch
- setup.py
- src/pytransit/__init__.py
- src/pytransit/__main__.py
- src/pytransit/analysis/anova.py
- src/pytransit/analysis/gi.py
- src/pytransit/analysis/heatmap.py
- src/pytransit/analysis/pathway_enrichment.py
- src/pytransit/analysis/tn5gaps.py
- src/pytransit/analysis/zinb.py
- src/pytransit/convert/gff_to_prot_table.py
- src/pytransit/doc/source/conf.py
- src/pytransit/doc/source/index.rst
- src/pytransit/doc/source/transit_methods.rst


Changes:

=====================================
CHANGELOG.md
=====================================
@@ -2,6 +2,18 @@
 All notable changes to this project will be documented in this file.
 
 
+
+## Version 3.2.2 2022-09-08
+#### TRANSIT:
+ - fixed bug in converting gff_to_prot_table
+ - fixed bug in tn5gaps (fixes some false negative calls)
+ - fixed some bugs in pathway_enrichment (GSEA calculations)
+ - fixed links to Salmonella Tn5 data in docs
+ - fixed problem with margins in heatmap.py that was causing R to fail 
+ - added --ref to anova.py and zinb.py (for computing LFCs relative to designated reference condition)
+ - added --low_mean_filter for heatmap.py (for excluding genes with low counts, even if they are significant by ANOVA or ZINB)
+ - add dependency on pypubsub<4.0
+	
 ## Version 3.2.1 2020-12-22
 #### TRANSIT:
  - maintenance release


=====================================
debian/changelog
=====================================
@@ -1,3 +1,11 @@
+tnseq-transit (3.2.2-1) unstable; urgency=medium
+
+  * Fix watchfile to detect new versions on github
+  * New upstream version
+  * Standards-Version: 4.6.0 (routine-update)
+
+ -- Andreas Tille <tille at debian.org>  Tue, 12 Oct 2021 16:19:02 +0200
+
 tnseq-transit (3.2.1-2) unstable; urgency=medium
 
   * d/p/fix_problematic_comparison.patch (Closes: #984958)


=====================================
debian/control
=====================================
@@ -14,7 +14,7 @@ Build-Depends: debhelper-compat (= 13),
                python3-statsmodels,
                python3-pubsub,
                bwa
-Standards-Version: 4.5.1
+Standards-Version: 4.6.0
 Vcs-Browser: https://salsa.debian.org/med-team/tnseq-transit
 Vcs-Git: https://salsa.debian.org/med-team/tnseq-transit.git
 Homepage: http://saclab.tamu.edu/essentiality/transit/


=====================================
debian/watch
=====================================
@@ -1,3 +1,3 @@
 version=4
 
-https://github.com/mad-lab/transit/releases .*/archive/v(\d[\d.-]+)\.(?:tar(?:\.gz|\.bz2)?|tgz)
+https://github.com/mad-lab/transit/releases .*/v(\d[\d.-]+)\.(?:tar(?:\.gz|\.bz2)?|tgz)


=====================================
setup.py
=====================================
@@ -148,7 +148,9 @@ setup(
     # your project is installed. For an analysis of "install_requires" vs pip's
     # requirements files see:
     # https://packaging.python.org/en/latest/requirements.html
-    install_requires=['setuptools', 'numpy~=1.16', 'scipy~=1.2', 'matplotlib~=3.0', 'pillow~=6.0', 'statsmodels~=0.9'],
+    # 'pypubsub<4.0' and 'wxPython' are needed for GUI only, but go ahead and install them
+    # the reason for restriction on pypubsub is that version>=4.0 does not work with python2 - I can probably get rid of this restriction, since everybody must be using python3 by now
+    install_requires=['setuptools', 'numpy~=1.16', 'scipy~=1.2', 'matplotlib~=3.0', 'pillow~=6.0', 'statsmodels~=0.9', 'pypubsub<4.0', 'wxPython'],
 
     #dependency_links = [
     #	"git+https://github.com/wxWidgets/wxPython.git#egg=wxPython"


=====================================
src/pytransit/__init__.py
=====================================
@@ -2,6 +2,6 @@
 __all__ = ["transit_tools", "tnseq_tools", "norm_tools", "stat_tools"]
 
 
-__version__ = "v3.2.1"
+__version__ = "v3.2.2"
 prefix = "[TRANSIT]"
 


=====================================
src/pytransit/__main__.py
=====================================
@@ -91,11 +91,9 @@ def main(*args, **kwargs):
 
     # Tried GUI mode but has no wxPython
     elif not (args or kwargs) and not hasWx:
-        print("Please install wxPython to run in GUI Mode.")
-        print("To run in Console Mode please follow these instructions:")
+        print("Please install wxPython to run in GUI Mode. (pip install wxPython)")
         print("")
-        print("Usage: python %s <method>" % sys.argv[0])
-        print("List of known methods:")
+        print("To run in Console Mode, try 'transit <method>' with one of the following methods:")
         for m in methods:
             print("\t - %s" % m)
     # Running in Console mode


=====================================
src/pytransit/analysis/anova.py
=====================================
@@ -34,10 +34,11 @@ class AnovaMethod(base.MultiConditionMethod):
     """
     anova
     """
-    def __init__(self, combined_wig, metadata, annotation, normalization, output_file, ignored_conditions=[], included_conditions=[], nterm=0.0, cterm=0.0, PC=1):
+    def __init__(self, combined_wig, metadata, annotation, normalization, output_file, ignored_conditions=[], included_conditions=[], nterm=0.0, cterm=0.0, PC=1, refs=[]):
         base.MultiConditionMethod.__init__(self, short_name, long_name, short_desc, long_desc, combined_wig, metadata, annotation, output_file,
                 normalization=normalization, ignored_conditions=ignored_conditions, included_conditions=included_conditions, nterm=nterm, cterm=cterm)
         self.PC = PC
+        self.refs = refs
 
     @classmethod
     def transit_error(self,msg): print("error: %s" % msg) # for some reason, transit_error() in base class or transit_tools doesn't work right; needs @classmethod
@@ -58,18 +59,20 @@ class AnovaMethod(base.MultiConditionMethod):
         NTerminus = float(kwargs.get("iN", 0.0))
         CTerminus = float(kwargs.get("iC", 0.0))
         PC = int(kwargs.get("PC", 5))
+        refs = kwargs.get("-ref",[]) # list of condition names to use a reference for calculating LFCs
+        if refs!=[]: refs = refs.split(',')
         ignored_conditions = list(filter(None, kwargs.get("-ignore-conditions", "").split(",")))
         included_conditions = list(filter(None, kwargs.get("-include-conditions", "").split(",")))
 
         # check for unrecognized flags
-        flags = "-n --ignore-conditions --include-conditions -iN -iC -PC".split()
+        flags = "-n --ignore-conditions --include-conditions -iN -iC -PC --ref".split()
         for arg in rawargs:
           if arg[0]=='-' and arg not in flags:
             self.transit_error("flag unrecognized: %s" % arg)
             print(AnovaMethod.usage_string())
             sys.exit(0)
 
-        return self(combined_wig, metadata, annotation, normalization, output_file, ignored_conditions, included_conditions, NTerminus, CTerminus, PC)
+        return self(combined_wig, metadata, annotation, normalization, output_file, ignored_conditions, included_conditions, NTerminus, CTerminus, PC, refs)
 
     def wigs_to_conditions(self, conditionsByFile, filenamesInCombWig):
         """
@@ -171,8 +174,9 @@ class AnovaMethod(base.MultiConditionMethod):
             p[rv],q[rv],statusMap[rv] = pvals[i],qvals[i],status[i]
         return (p, q, statusMap)
 
-    def calcLFCs(self,means,PC=1):
-      grandmean = numpy.mean(means)
+    def calcLFCs(self,means,refs=[],PC=1):
+      if len(refs)==0: refs = means # if ref condition(s) not explicitly defined, use mean of all
+      grandmean = numpy.mean(refs)
       lfcs = [math.log((x+PC)/float(grandmean+PC),2) for x in means]
       return lfcs
 
@@ -217,7 +221,8 @@ class AnovaMethod(base.MultiConditionMethod):
             Rv = gene["rv"]
             if Rv in MeansByRv:
               means = [MeansByRv[Rv][c] for c in conditionsList]
-              LFCs = self.calcLFCs(means,self.PC)
+              refs = [MeansByRv[Rv][c] for c in self.refs]
+              LFCs = self.calcLFCs(means,refs,self.PC)
               vals = ([Rv, gene["gene"], str(len(RvSiteindexesMap[Rv]))] +
                       ["%0.2f" % x for x in means] + 
                       ["%0.3f" % x for x in LFCs] + 
@@ -234,6 +239,7 @@ class AnovaMethod(base.MultiConditionMethod):
   -n <string>         :=  Normalization method. Default: -n TTR
   --include-conditions <cond1,...> := Comma-separated list of conditions to use for analysis (Default: all)
   --ignore-conditions <cond1,...> := Comma-separated list of conditions to ignore (Default: none)
+  --ref <cond> := which condition(s) to use as a reference for calculating LFCs (comma-separated if multiple conditions)
   -iN <N> :=  Ignore TAs within given percentage (e.g. 5) of N terminus. Default: -iN 0
   -iC <N> :=  Ignore TAs within given percentage (e.g. 5) of C terminus. Default: -iC 0
   -PC <N> := pseudocounts to use for calculating LFC. Default: -PC 5"""


=====================================
src/pytransit/analysis/gi.py
=====================================
@@ -824,7 +824,7 @@ class GIMethod(base.QuadConditionMethod):
         data.sort(key=lambda x: x[probcol]) 
         sortedprobs = numpy.array([x[probcol] for x in data]) 
 
-        # BFDR method: Newton M.A., Noueiry A., Sarkar D., Ahlquist P. (2004). Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics, 5:155–176.
+        # BFDR method: Newton et al (2004). Detecting differential gene expression with a semiparametric hierarchical mixture method.  Biostatistics, 5:155-176.
 
         if self.signif=="BFDR":
             sortedprobs = numpy.array(sortedprobs) 


=====================================
src/pytransit/analysis/heatmap.py
=====================================
@@ -112,6 +112,7 @@ class HeatmapMethod(base.SingleConditionMethod):
         self.outfile = args[1]
         self.qval = float(kwargs.get("qval",0.05))
         self.topk = int(kwargs.get("topk",-1))
+        self.low_mean_filter = int(kwargs.get("low_mean_filter",5)) # filter out genes with grandmean<5 by default
         return self(self.infile,outfile=self.outfile)
 
     def Run(self):
@@ -136,14 +137,18 @@ class HeatmapMethod(base.SingleConditionMethod):
             headers = headers[3:3+n]
             headers = [x.replace("Mean_","") for x in headers]
           else:
-            lfcs = [float(x) for x in w[3+n:3+n+n]] # take just the columns of means
+            means = [float(x) for x in w[3:3+n]] # take just the columns of means
+            lfcs = [float(x) for x in w[3+n:3+n+n]] # take just the columns of LFCs
             qval = float(w[-2])
-            data.append((w,lfcs,qval))
+            data.append((w,means,lfcs,qval))
 
         data.sort(key=lambda x: x[-1])
         hits,LFCs = [],[]
-        for k,(w,lfcs,qval) in enumerate(data):
-          if (self.topk==-1 and qval<self.qval) or (self.topk!=-1 and k<self.topk): hits.append(w); LFCs.append(lfcs)
+        for k,(w,means,lfcs,qval) in enumerate(data):
+          if (self.topk==-1 and qval<self.qval) or (self.topk!=-1 and k<self.topk): 
+            mm = round(numpy.mean(means),1)
+            if mm<self.low_mean_filter: print("excluding %s/%s, mean(means)=%s" % (w[0],w[1],mm))
+            else: hits.append(w); LFCs.append(lfcs)
 
         print("heatmap based on %s genes" % len(hits))
         genenames = ["%s/%s" % (w[0],w[1]) for w in hits]
@@ -168,8 +173,9 @@ H = 300+R*15
 
 png(outfilename,width=W,height=H)
 #defaults are lwid=lhei=c(1.5,4)
-heatmap.2(as.matrix(lfcs),col=colors,margin=c(12,12),lwid=c(2,6),lhei=c(0.1,2),trace="none",cexCol=1.4,cexRow=1.4,key=T) # make sure white=0
-heatmap.2(as.matrix(lfcs),col=colors,margin=c(12,12),lwid=c(2,6),lhei=c(0.1,2),trace="none",cexCol=1.4,cexRow=1.4,key=T) # make sure white=0
+#heatmap.2(as.matrix(lfcs),col=colors,margin=c(12,12),lwid=c(2,6),lhei=c(0.1,2),trace="none",cexCol=1.4,cexRow=1.4,key=T) # make sure white=0
+#heatmap.2(as.matrix(lfcs),col=colors,margin=c(12,12),trace="none",cexCol=1.2,cexRow=1.2,key=T) # make sure white=0 # setting margins was causing failures, so remove it 8/22/21
+heatmap.2(as.matrix(lfcs),col=colors,margin=c(12,12),trace="none",cexCol=1.2,cexRow=1.2,key=T) # actually, margins was OK, so the problem must have been with lhei and lwid
 dev.off()
 }
       ''')
@@ -177,7 +183,7 @@ dev.off()
 
     @classmethod
     def usage_string(self):
-        return "usage: python3 %s heatmap <anova_or_zinb_output> <heatmap.png> -anova|-zinb [-topk <int>] [-qval <float>]\n note: genes are selected based on qval<0.05 by default" % sys.argv[0]
+        return "usage: python3 %s heatmap <anova_or_zinb_output> <heatmap.png> -anova|-zinb [-topk <int>] [-qval <float>] [-low_mean_filter <int>]\n note: genes are selected based on qval<0.05 by default" % sys.argv[0]
 
 
 if __name__ == "__main__":


=====================================
src/pytransit/analysis/pathway_enrichment.py
=====================================
@@ -160,11 +160,14 @@ Optional parameters:
 
   # assume these are listed as pairs (tab-sep)
   # return bidirectional hash (genes->[terms], terms->[genes]; each can be one-to-many, hence lists)
-  def read_associations(self,filename):
+  # filter could be a subset of genes we want to focus on (throw out the rest)
+
+  def read_associations(self,filename,filter=None):
     associations = {}
     for line in open(filename):
       if line[0]=='#': continue
       w = line.rstrip().split('\t')
+      if filter!=None and w[0] not in filter: continue # skip genes in association file that are not relevant (i.e. not in resampling file)
       # store mappings in both directions
       for (a,b) in [(w[0],w[1]),(w[1],w[0])]:
         if a not in associations: associations[a] = []
@@ -218,8 +221,11 @@ Optional parameters:
     
   def GSEA(self):
     data,hits,headers = self.read_resampling_file(self.resamplingFile) # hits are not used in GSEA()
+    orfs_in_resampling_file = [w[0] for w in data]
     headers = headers[-1].rstrip().split('\t')
-    associations = self.read_associations(self.associationsFile)
+    associations = self.read_associations(self.associationsFile,filter=orfs_in_resampling_file) # bidirectional map; includes term->genelist and gene->termlist
+    # filter: project associations (of orfs to pathways) onto only those orfs appearing in the resampling file
+
     ontology = self.read_pathways(self.pathwaysFile)
     genenames = {}
     for gene in data: genenames[gene[0]] = gene[1]
@@ -235,14 +241,18 @@ Optional parameters:
     # rank by SLPV=sign(LFC)*log10(pval)
     # note: genes with lowest p-val AND negative LFC have highest scores (like positive correlation)
     # there could be lots of ties with pval=0 or 1, but that's OK
-    LFC_col = headers.index("log2FC")
-    Pval_col = headers.index("p-value")
     pairs = [] # pair are: rv and score (SLPV)
     for w in data:
-      orf,LFC,Pval = w[0],float(w[LFC_col]),float(w[Pval_col])
-      SLPV = (-1 if LFC<0 else 1)*math.log(Pval+0.000001,10)
-      if self.ranking=="SLPV": pairs.append((orf,SLPV))
-      elif self.ranking=="LFC": pairs.append((orf,LFC))
+      orf = w[0]
+      if self.ranking=="SLPV": 
+        Pval_col = headers.index("p-value")
+        Pval = float(w[Pval_col])
+        SLPV = (-1 if LFC<0 else 1)*math.log(Pval+0.000001,10)
+        pairs.append((orf,SLPV))
+      elif self.ranking=="LFC": 
+        LFC_col = headers.index("log2FC")
+        LFC = float(w[LFC_col])
+        pairs.append((orf,LFC))
 
     # pre-randomize ORFs, to avoid genome-position bias in case of ties in pvals (e.g. 1.0)
     indexes = range(len(pairs))
@@ -260,23 +270,25 @@ Optional parameters:
     for i,term in enumerate(terms):
       sys.stdout.flush()
       orfs = terms2orfs.get(term,[])
-      if len(orfs)<=1: continue
+      num_genes_in_pathway = len(orfs)
+      if num_genes_in_pathway<2: continue # skip pathways with less than 2 genes
       mr = self.mean_rank(orfs,orfs2rank)
       es = self.enrichment_score(orfs,orfs2rank,orfs2score,p=self.p) # always positive, even if negative deviation, since I take abs
       higher = 0
       for n in range(Nperm):
-        perm = random.sample(allgenes,len(orfs)) # compare to ES for random sets of genes of same size
-        if self.enrichment_score(perm,orfs2rank,orfs2score,p=self.p)>es: higher += 1
-        if n>100 and higher>10: break # adaptive
+        perm = random.sample(allgenes,num_genes_in_pathway) # compare to enrichment score for random sets of genes of same size
+        e2 = self.enrichment_score(perm,orfs2rank,orfs2score,p=self.p)
+        if e2>es: higher += 1
+        if n>100 and higher>10: break # adaptive: can stop after seeing 10 events (permutations with higher ES)
       pval = higher/float(n)
-      vals = ['#',term,len(orfs),mr,es,pval,ontology.get(term,"?")]
+      vals = ['#',term,num_genes_in_pathway,mr,es,pval,ontology.get(term,"?")]
       #sys.stderr.write(' '.join([str(x) for x in vals])+'\n')
       pctg=(100.0*i)/Total
       text = "Running Pathway Enrichment Method... %5.1f%%" % (pctg)
       self.progress_update(text, i)      
       results.append((term,mr,es,pval))
     
-    results.sort(key=lambda x: x[1])
+    results.sort(key=lambda x: x[1]) # sort on mean rank
     pvals = [x[-1] for x in results]
     rej,qvals = multitest.fdrcorrection(pvals)
     results = [tuple(list(res)+[q]) for res,q in zip(results,qvals)]


=====================================
src/pytransit/analysis/tn5gaps.py
=====================================
@@ -234,7 +234,7 @@ class Tn5GapsMethod(base.SingleConditionMethod):
         self.transit_message("Starting Tn5 gaps method")
         start_time = time.time()
         
-        self.transit_message("Getting data (May take a while)")
+        self.transit_message("Loading data (May take a while)")
         
         # Combine all wigs
         (data,position) = transit_tools.get_validated_data(self.ctrldata, wxobj=self.wxobj)
@@ -256,7 +256,7 @@ class Tn5GapsMethod(base.SingleConditionMethod):
         exp_cutoff = exprunmax + 2*stddevrun
 
         # Get the runs
-        self.transit_message("Getting non-insertion runs in genome")
+        self.transit_message("Identifying non-insertion runs in genome")
         run_arr = tnseq_tools.runs_w_info(counts)
         pos_hash = transit_tools.get_pos_hash(self.annotation_path)
 
@@ -275,13 +275,13 @@ class Tn5GapsMethod(base.SingleConditionMethod):
             count += 1
             genes = tnseq_tools.get_genes_in_range(pos_hash, run['start'], run['end'])
             for gene_orf in genes:
+                gene = genes_obj[gene_orf] # bug fix: moved this up
                 start,end = gene.start,gene.end
                 a,b = self.NTerminus,self.CTerminus
                 if gene.strand=="-": a,b = b,a
                 start = start+int((end-start)*(a/100.))
                 end = end-int((end-start)*(b/100.))
 
-                gene = genes_obj[gene_orf]
                 inter_sz = self.intersect_size([run['start'], run['end']], [start,end]) + 1
                 percent_overlap = self.calc_overlap([run['start'], run['end']], [start,end])
                 run_len = run['length']
@@ -296,9 +296,10 @@ class Tn5GapsMethod(base.SingleConditionMethod):
                     results_per_gene[gene.orf] = [gene.orf, gene.name, gene.desc, gene.k, gene.n, gene.r, inter_sz, run_len, pval]
             
             # Update Progress
-            text = "Running Tn5Gaps method... %1.1f%%" % (100.0*count/N) 
-            self.progress_update(text, count)
-                
+            if count%10000==0: 
+              text = "Running Tn5Gaps method... %1.1f%%" % (100.0*count/N) 
+              self.progress_update(text, count)
+
         data = list(results_per_gene.values())
         exp_run_len = float(accum)/N
         
@@ -331,9 +332,10 @@ class Tn5GapsMethod(base.SingleConditionMethod):
         self.output.write("#Essential gene count: %d\n" % (sig_genes_count))
         self.output.write("#Minimum reads: %d\n" % (self.minread))
         self.output.write("#Replicate combination method: %s\n" % (self.replicates))
+        self.output.write("#Insertion density: %0.3f\n" % (pins))
+        self.output.write("#Mean run length: %0.1f\n" % (exp_run_len))
+        self.output.write("#Expected max run length: %0.1f\n" % (exprunmax))
         self.output.write("#Minimum significant run length: %d\n" % (min_sig_len))
-        self.output.write("#Expected run length: %1.5f\n" % (exp_run_len))
-        self.output.write("#Expected max run length: %s\n" % (exprunmax))
         self.output.write("#%s\n" % "\t".join(columns))
         #self.output.write("#Orf\tName\tDesc\tk\tn\tr\tovr\tlenovr\tpval\tpadj\tcall\n")
 


=====================================
src/pytransit/analysis/zinb.py
=====================================
@@ -48,7 +48,7 @@ class ZinbMethod(base.MultiConditionMethod):
     """
     Zinb
     """
-    def __init__(self, combined_wig, metadata, annotation, normalization, output_file, ignored_conditions=[], included_conditions=[], winz=False, nterm=5.0, cterm=5.0, condition="Condition", covars=[], interactions = [], PC=1):
+    def __init__(self, combined_wig, metadata, annotation, normalization, output_file, ignored_conditions=[], included_conditions=[], winz=False, nterm=5.0, cterm=5.0, condition="Condition", covars=[], interactions = [], PC=1, refs=[]):
         base.MultiConditionMethod.__init__(self, short_name, long_name, short_desc, long_desc, combined_wig, metadata, annotation, output_file,
                 normalization=normalization, ignored_conditions=ignored_conditions, included_conditions=included_conditions, nterm=nterm, cterm=cterm)
         self.winz = winz
@@ -56,6 +56,7 @@ class ZinbMethod(base.MultiConditionMethod):
         self.interactions = interactions
         self.condition = condition
         self.PC = PC
+        self.refs = refs
 
     @classmethod
     def transit_error(self,msg): print("error: %s" % msg) # for some reason, transit_error() in base class or transit_tools doesn't work right; needs @classmethod
@@ -92,19 +93,21 @@ class ZinbMethod(base.MultiConditionMethod):
         condition = kwargs.get("-condition", "Condition")
         covars = list(filter(None, kwargs.get("-covars", "").split(",")))
         interactions = list(filter(None, kwargs.get("-interactions", "").split(",")))
+        refs = kwargs.get("-ref",[]) # list of condition names to use a reference for calculating LFCs
+        if refs!=[]: refs = refs.split(',')
         winz = True if "w" in kwargs else False
         ignored_conditions = list(filter(None, kwargs.get("-ignore-conditions", "").split(",")))
         included_conditions = list(filter(None, kwargs.get("-include-conditions", "").split(",")))
 
         # check for unrecognized flags
-        flags = "-n --ignore-conditions --include-conditions -iN -iC -PC --condition --covars --interactions --gene".split()
+        flags = "-n --ignore-conditions --include-conditions -iN -iC -PC --condition --covars --interactions --gene --ref".split()
         for arg in rawargs:
           if arg[0]=='-' and arg not in flags:
             self.transit_error("flag unrecognized: %s" % arg)
             print(ZinbMethod.usage_string())
             sys.exit(0)
 
-        return self(combined_wig, metadata, annotation, normalization, output_file, ignored_conditions, included_conditions, winz, NTerminus, CTerminus, condition, covars, interactions, PC)
+        return self(combined_wig, metadata, annotation, normalization, output_file, ignored_conditions, included_conditions, winz, NTerminus, CTerminus, condition, covars, interactions, PC, refs)
 
     def wigs_to_conditions(self, conditionsByFile, filenamesInCombWig):
         """
@@ -615,9 +618,10 @@ class ZinbMethod(base.MultiConditionMethod):
             Rv = gene["rv"]
             means = [statsByRv[Rv]['mean'][group] for group in orderedStatGroupNames]
             PC = self.PC
-            if len(means)==2: LFCs = [numpy.math.log((means[1]+PC)/(means[0]+PC),2)]
+            if len(means)==2: LFCs = [numpy.math.log((means[1]+PC)/(means[0]+PC),2)] # still need to adapt this to use --ref if defined
             else: 
-              m = numpy.mean(means)
+              if len(self.refs)==0: m = numpy.mean(means) # grand mean across all conditions
+              else: m = numpy.mean([statsByRv[Rv]['mean'][group] for group in self.refs]) ###TRI
               LFCs = [numpy.math.log((x+PC)/(m+PC),2) for x in means]
             vals = ([Rv, gene["gene"], str(len(RvSiteindexesMap[Rv]))] +
                     ["%0.1f" % statsByRv[Rv]['mean'][group] for group in orderedStatGroupNames] +
@@ -638,6 +642,7 @@ class ZinbMethod(base.MultiConditionMethod):
         -n <string>         :=  Normalization method. Default: -n TTR
         --ignore-conditions <cond1,cond2> :=  Comma separated list of conditions to ignore, for the analysis.
         --include-conditions <cond1,cond2> :=  Comma separated list of conditions to include, for the analysis. Conditions not in this list, will be ignored.
+        --ref <cond> := which condition(s) to use as a reference for calculating LFCs (comma-separated if multiple conditions)
         -iN <float>     :=  Ignore TAs occuring within given percentage (as integer) of the N terminus. Default: -iN 5
         -iC <float>     :=  Ignore TAs occuring within given percentage (as integer) of the C terminus. Default: -iC 5
         -PC <N>         := pseudocounts to use for calculating LFCs. Default: -PC 5


=====================================
src/pytransit/convert/gff_to_prot_table.py
=====================================
@@ -143,7 +143,7 @@ class GffProtMethod(base.ConvertMethod):
 
         for i, line in enumerate(lines):
             line = line.strip()
-            if line.startswith('#'): continue
+            if len(line)==0 or line.startswith('#'): continue
             cols = line.split('\t')
             if (len(cols) < 9): 
                 sys.stderr.write(("Ignoring invalid row with entries: {0}\n".format(cols)))
@@ -160,6 +160,7 @@ class GffProtMethod(base.ConvertMethod):
                     labels[k.strip()] = v.strip()
                 Rv = labels["locus_tag"].strip() # error out if not found
                 gene = labels.get('gene', '') # or Name?
+                if gene=="": gene = '-'
                 desc = labels.get('product', '') 
                 vals = [desc, start, end, strand, size, '-', '-', gene, Rv, '-']
                 writer.writerow(vals)


=====================================
src/pytransit/doc/source/conf.py
=====================================
@@ -34,7 +34,7 @@ extensions = [
     'sphinx.ext.todo',
     'sphinx.ext.viewcode',
     'sphinx.ext.napoleon',
-    'sphinx.ext.mathbase',
+    #'sphinx.ext.mathbase',
     'sphinx.ext.mathjax',
 ]
 
@@ -144,6 +144,15 @@ html_theme = 'sphinx_rtd_theme'
 # documentation.
 #html_theme_options = {}
 
+#html_theme_options = {
+#    'collapse_navigation': False,
+#    'sticky_navigation': True,
+#    'navigation_depth': 4,
+#    'includehidden': True,
+#    'titles_only': False
+#}
+
+
 # Add any paths that contain custom themes here, relative to this directory.
 #html_theme_path = []
 
@@ -384,5 +393,6 @@ epub_exclude_files = ['search.html']
 # If false, no index is generated.
 #epub_use_index = True
 def setup(app):
-    app.add_stylesheet('css/default.css')
+    #app.add_stylesheet('css/default.css')
+    app.add_css_file('css/default.css')
 


=====================================
src/pytransit/doc/source/index.rst
=====================================
@@ -8,7 +8,13 @@ Welcome to TRANSIT's documentation!
     :target: https://github.com/mad-lab/transit
     :alt: GitHub last tag
 
-This page contains the documentation for TRANSIT. Below are a few quick links to some of the most important sections of the documentation, followed by a brief overview of TRANSIT's features.
+Transit is python-based software for analyzing TnSeq data
+(sequencing data from transposon mutant libraries)
+to determine essentiality of bacterial genes under different conditions.
+
+This page contains the documentation for TRANSIT. Below are a few
+quick links to some of the most important sections of the
+documentation, followed by a brief overview of TRANSIT's features.
 
 Quick Links
 ~~~~~~~~~~~
@@ -20,7 +26,7 @@ Quick Links
 * :ref:`tutorial-link`
 * :ref:`tpp-link`
 * :ref:`code-link`
-
+* `PDF manual with overview of analysis methods in Transit <https://orca1.tamu.edu/essentiality/transit/transit-manual.pdf>`_
 
 Features
 ~~~~~~~~
@@ -57,8 +63,8 @@ TRANSIT offers a variety of features including:
 .. _manual-link:
 
 .. toctree::
-   :maxdepth: 2
-   :caption: TRANSIT Manual
+   :maxdepth: 3
+   :caption: TRANSIT MANUAL
 
    transit_overview
    transit_install
@@ -70,7 +76,7 @@ TRANSIT offers a variety of features including:
 .. _tutorial-link:
 
 .. toctree::
-   :maxdepth: 2
+   :maxdepth: 3
    :caption: TRANSIT Tutorials
 
    transit_essentiality_tutorial
@@ -83,7 +89,7 @@ TRANSIT offers a variety of features including:
 .. _tpp-link:
 
 .. toctree::
-   :maxdepth: 2
+   :maxdepth: 3
    :caption: TPP Manual
 
    tpp.rst
@@ -91,7 +97,7 @@ TRANSIT offers a variety of features including:
 .. _code-link:
 
 .. toctree::
-   :maxdepth: 2
+   :maxdepth: 3
    :caption: Code Documentation
 
    transit


=====================================
src/pytransit/doc/source/transit_methods.rst
=====================================
@@ -2,12 +2,16 @@
 .. _`analysis_methods`:
 
 ==================
- Analysis Methods
+ Analysis Methods (Expand Me First!)
 ==================
 
 
-TRANSIT has analysis methods capable of analyzing **Himar1** and **Tn5** datasets.
-Below is a description of some of the methods.
+TRANSIT has analysis methods capable of analyzing **Himar1** and
+**Tn5** datasets.  Below is a description of some of the methods.  
+
+The analysis methods in Transit are also described in this `PDF manual
+<https://orca1.tamu.edu/essentiality/transit/transit-manual.pdf>`_ .
+
 
 |
 
@@ -221,7 +225,7 @@ AK. (2009). `Simultaneous assay of every Salmonella Typhi gene using one million
 transposon mutants. <http://www.ncbi.nlm.nih.gov/pubmed/19826075>`_ *Genome Res.* , 19(12):2308-16.
 
 This data was downloaded from SRA (located `here <http://trace.ncbi.nlm.nih.gov/Traces/sra/?study=ERP000051>`_) , and used to make
-wig files (`base <http://orca1.tamu.edu/essentiality/transit/data/salmonella_base.wig>`_ and `bile <http://orca1.tamu.edu/essentiality/transit/data/salmonella_bile.wig>`_) and the following 4 baseline datasets
+wig files (`baseline <http://orca1.tamu.edu/essentiality/transit/data/salmonella_baseline.wig>`_ and `bile <http://orca1.tamu.edu/essentiality/transit/data/salmonella_bile.wig>`_) and the following 4 baseline datasets
 were merged to make a wig file: (IL2_2122_1,3,6,8). Our analysis
 produced 415 genes with adjusted p-values less than 0.05, indicating
 essentiality, and the analysis from the above paper produced 356
@@ -530,7 +534,7 @@ parameters are available for the method:
    calculated will have, at the expense of longer computation time. The
    resampling method runs on 10,000 samples by default.
 
--  **Output Histograms:**\ Determines whether to output .png images of
+-  **Output Histograms:** Determines whether to output .png images of
    the histograms obtained from resampling the difference in
    read-counts.
 
@@ -552,7 +556,7 @@ parameters are available for the method:
    as real differences. See the :ref:`Normalization <normalization>` section for a description
    of normalization method available in TRANSIT.
 
--  **--ctrl_lib, --exp_lib:** These are for doing resampling with datasets from multiple libraries, see below.
+-  **\-\-ctrl_lib, \-\-exp_lib:** These are for doing resampling with datasets from multiple libraries, see below.
 
 -  **-iN, -iC:** Trimming of TA sites near N- and C-terminus.
    The default for trimming TA sites in the termini of ORFs is 0.
@@ -1051,6 +1055,7 @@ Example
         -n <string>         :=  Normalization method. Default: -n TTR
         --ignore-conditions <cond1,...> :=  Comma separated list of conditions to ignore, for the analysis. Default: None
         --include-conditions <cond1,...> :=  Comma separated list of conditions to include, for the analysis. Default: All
+        --ref <cond> := which condition(s) to use as a reference for calculating LFCs (comma-separated if multiple conditions)
         -iN <float> :=  Ignore TAs occurring within given percentage (as integer) of the N terminus. Default: -iN 0
         -iC <float> :=  Ignore TAs occurring within given percentage (as integer) of the C terminus. Default: -iC 0
         -PC         := Pseudocounts to use in calculating LFCs. Default: -PC 5
@@ -1083,12 +1088,22 @@ The filenames should match what is shown in the header of the combined_wig (incl
 Parameters
 ----------
 
-The following parameters are available for the method:
+The following parameters are available for the ANOVA method:
 
--  **Ignore Conditions, Include Conditions:** Can use this to drop
-   conditions not of interest or specify a particular subset of conditions to use for ANOVA analysis.
+-  **\-\-include-conditions:** Includes the given set of conditions from the ZINB test. Conditions not in this list are ignored. Note: this is useful for specifying the order in which the columns are listed in the output file.
 
--  **Normalization Method:** Determines which normalization method to
+-  **\-\-ignore-conditions:** Can use this to drop conditions not of interest.
+
+-  **\-\-ref:** Specify which condition to use as a reference for computing LFCs.
+   By default, LFCs for each gene in each condition are calculated with respect
+   to the *grand mean* count across all conditions (so conditions with higher counts will be balanced
+   with conditions with lower counts).  However, if there is a defined reference condition
+   in the data, it may be specified using **\-\-ref** (in which case LFCs for that condition will
+   be around 0, and will be positive or negative for the other conditions, depending on whether
+   counts are higher or lower than the reference condintion.  If there is more than one
+   condition to use as reference (i.e. pooled), they may be given as a comma-separated list.
+
+-  **-n** Normalization Method. Determines which normalization method to
    use when comparing datasets. Proper normalization is important as it
    ensures that other sources of variability are not mistakenly treated
    as real differences. See the :ref:`Normalization <normalization>` section for a description
@@ -1096,6 +1111,8 @@ The following parameters are available for the method:
 
 -  **-PC** Pseudocounts to use in calculating LFCs (see below). Default: -PC 5
 
+
+
 Output and Diagnostics
 ----------------------
 
@@ -1203,6 +1220,7 @@ Example
         -n <string>         :=  Normalization method. Default: -n TTR
         --ignore-conditions <cond1,cond2> :=  Comma separated list of conditions to ignore, for the analysis. Default: None
         --include-conditions <cond1,cond2> :=  Comma separated list of conditions to include, for the analysis. Default: All
+        --ref <cond> := which condition(s) to use as a reference for calculating LFCs (comma-separated if more than one)
         -iN <float>     :=  Ignore TAs occuring within given percentage of the N terminus. Default: -iN 5
         -iC <float>     :=  Ignore TAs occuring within given percentage of the C terminus. Default: -iC 5
         -PC <N>         :=  Pseudocounts used in calculating LFCs in output file. Default: -PC 5
@@ -1265,8 +1283,9 @@ Parameters
 
 The following parameters are available for the method:
 
--  **Ignore Conditions:** Ignores the given set of conditions from the ZINB test.
--  **Include Conditions:** Includes the given set of conditions from the ZINB test. Conditions not in this list are ignored.
+-  **\-\-include-conditions:** Includes the given set of conditions from the ZINB test. Conditions not in this list are ignored. Note: this is useful for specifying the order in which the columns are listed in the output file.
+-  **\-\-ignore-conditions:** Ignores the given set of conditions from the ZINB test.
+-  **\-\-ref:** which condition to use as a reference when computing LFCs in the output file
 -  **Normalization Method:** Determines which normalization method to
    use when comparing datasets. Proper normalization is important as it
    ensures that other sources of variability are not mistakenly treated
@@ -1296,7 +1315,7 @@ strains (think: different 'slopes'). In such a case, we would say strain and tim
 
 If covariates distinguishing the samples are available,
 such as batch or library, they may be
-incorporated in the ZINB model by using the **\\-\\-covars** flag and samples
+incorporated in the ZINB model by using the **\-\-covars** flag and samples
 metadata file. For example, consider the following samples metadata
 file, with a column describing the batch information of each
 replicate.
@@ -1311,7 +1330,7 @@ replicate.
   chol2   cholesterol  /Users/example_data/cholesterol_rep3.wig     B2
 
 This information can be included to eliminate variability due to batch by using
-the **\\-\\-covars** flag.
+the **\-\-covars** flag.
 
 ::
 
@@ -1319,7 +1338,7 @@ the **\\-\\-covars** flag.
 
 
 Similarly, an interaction variable may be included in the model.
-This is specified by the user with the **\\-\\-interactions** flag,
+This is specified by the user with the **\-\-interactions** flag,
 followed by the name of a column in the samples metadata to test as the interaction
 with the condition. If there are multiple interactions, they may be given as a comma-separated list.
 
@@ -1336,7 +1355,7 @@ differs depending on the strain, we could do this:
  
 In this case, the condition is implicitly assumed to be the column in the samples metadata file
 labeled 'Condition'.  If you want to specify a different column to use as the primary condition to 
-test (for example, if Treatment were a distinct column), you can use the **\\-\\-condition** flag:
+test (for example, if Treatment were a distinct column), you can use the **\-\-condition** flag:
 
 ::
 
@@ -1495,7 +1514,10 @@ typical threshold for conditional essentiality on is q-value < 0.05.
 **LFCs** (log-fold-changes):
 For each condition, the LFC is calculated as the log-base-2 of the
 ratio of mean insertion count in that condition **relative to the 
-mean of means across all the conditions**.
+mean of means across all the conditions** (by default).
+However, you can change this by desginating a specific reference condition using the flag **\-\-ref**.
+(If there are multiple reference conditions, they may be given as a comma separated list.)
+(If you are using interactions, it is more complicated to specify a reference condition by name because they have to include the interactions, e.g. as shown in the column headers in the output file.)
 Pseudocount are incorporated to reduce the impact of noise on LFCs, based on the formula below.
 The pseudocounts can be adjusted using the -PC flag.
 Changing the pseudocounts (via -PC) can reduce the artifactual appearance of genes with
@@ -1723,7 +1745,26 @@ Parameters
 
     `Grossmann S, Bauer S, Robinson PN, Vingron M. Improved detection of overrepresentation of Gene-Ontology annotations with parent child analysis. Bioinformatics. 2007 Nov 15;23(22):3024-31. <https://www.ncbi.nlm.nih.gov/pubmed/17848398>`_
 
+  For the ONT method in pathway_enrichment, the enrichment for a given
+  GO term can be expressed (in a simplified way, leaving out the
+  pseudocounts) as:
+
+::
+
+  enrichment = log (  (b/q) / (m/p)  )
+|
+
+  where:
+
+*    b is the number of genes with this GO term in the subset of hits (e.g. conditional essentials from resampling, with qval<0.05)
+*    q is the number of genes in the subset of hits with a parent of this GO term
+*    m is the total number of genes with this GO term in the genome
+*    p is the number of genes in the genome with a parent of this GO term 
+
+  So enrichment is the log of the ratio of 2 ratios:
 
+  1. the relative abundance of genes with this GO term compared to those with a parent GO term   among the hits
+  2. the relative abundance of genes with this GO term compared to those with a parent GO term   in the whole genome
 
 
 Auxilliary Pathway Files in Transit Data Directory
@@ -1919,7 +1960,7 @@ Usage:
 
 ::
 
-  python3 src/transit.py heatmap <anova_or_zinb_output> <heatmap.png> -anova|-zinb [-topk <int>] [-qval <float]
+  python3 src/transit.py heatmap <anova_or_zinb_output> <heatmap.png> -anova|-zinb [-topk <int>] [-qval <float] [-low_mean_filter <int>]
 
 Note that the first optional argument (flag) is required to be either '-anova' or '-zinb', a flag to
 indicate the type of file being provided as the second argument.
@@ -1929,6 +1970,7 @@ However, the user may change the selection of genes through 2 flags:
 
  * **-qval <float>**: change qval threshold for selecting genes (default=0.05)
  * **-topk <int>**: select top k genes ranked by significance (qval)
+ * **-low_mean_filter <int>**: filter out genes with grand mean count (across all conditions) below this threshold (even if qval<0.05); default is to exclude genes with mean count<5
 
 Here is an example which generates the following image showing the similarities among
 several different growth conditions:
@@ -1943,7 +1985,7 @@ several different growth conditions:
 
 
 Importantly, the heatmap is based only on the subset of genes
-identified as significantly varying (Padj < 0:05, typically only a few
+identified as *significantly varying* (Padj < 0:05, typically only a few
 hundred genes) in order to enhance the patterns, since otherwise they would
 be washed out by the rest of the genes in the genome, the majority of
 which usually do not exhibit significant variation in counts.



View it on GitLab: https://salsa.debian.org/med-team/tnseq-transit/-/compare/f9a54014a01047da490931368f80c851c54b4d8d...ee0f9a77e0696093ad5f76a2c2ef704fe2d0de8e

-- 
View it on GitLab: https://salsa.debian.org/med-team/tnseq-transit/-/compare/f9a54014a01047da490931368f80c851c54b4d8d...ee0f9a77e0696093ad5f76a2c2ef704fe2d0de8e
You're receiving this email because of your account on salsa.debian.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20211012/6f29e3a0/attachment-0001.htm>