[med-svn] [Git][med-team/tnseq-transit][upstream] New upstream version 3.0.1

Michael R. Crusoe gitlab at salsa.debian.org
Fri Nov 8 17:17:49 GMT 2019



Michael R. Crusoe pushed to branch upstream at Debian Med / tnseq-transit


Commits:
cad37841 by Michael R. Crusoe at 2019-11-08T17:06:43Z
New upstream version 3.0.1
- - - - -


12 changed files:

- CHANGELOG.md
- Dockerfile
- README.md
- setup.py
- src/pytpp/__main__.py
- src/pytpp/tpp_tools.py
- src/pytransit/__init__.py
- src/pytransit/__main__.py
- src/pytransit/analysis/gi.py
- − src/pytransit/analysis/multitnseq.py
- − src/pytransit/analysis/multitnseq_helpers.py
- src/pytransit/analysis/pathway_enrichment.py


Changes:

=====================================
CHANGELOG.md
=====================================
@@ -1,6 +1,11 @@
 # Change log
 All notable changes to this project will be documented in this file.
 
+## Version 3.0.1 2019-08-01
+#### TRANSIT:
+ - Add check for python3 (TRANSIT 3+ requires python3.6+)
+ - Minor fixes in GI and Pathway Enrichment
+
 ## Version 3.0.0 2019-07-18
 #### TRANSIT:
  - TRANSIT now supports python 3. (To use with python 2, use releases < 3.0.0)


=====================================
Dockerfile
=====================================
@@ -1,8 +1,8 @@
-From r-base:3.4.1
-RUN apt-get update -y && apt-get install -y -f python2 python-dev python-pip
+From r-base:3.6.1
+RUN apt-get update -y && apt-get install -y -f python3 python-dev python3-pip
 ADD src/ /src
 ADD tests/ /tests
-RUN pip install pytest 'numpy~=1.15' 'scipy~=1.2' 'matplotlib~=2.2' 'pillow~=5.0' 'statsmodels~=0.9' 'rpy2<2.9.0'
+RUN pip3 install pytest 'numpy~=1.16' 'scipy~=1.2' 'matplotlib~=3.0' 'pillow~=6.0' 'statsmodels~=0.9' 'rpy2'
 RUN R -e "install.packages('MASS')"
 RUN R -e "install.packages('pscl')"
 


=====================================
README.md
=====================================
@@ -1,8 +1,11 @@
 # TRANSIT
 
 [![Version](https://img.shields.io/github/tag/mad-lab/transit.svg)](https://github.com/mad-lab/transit)   [![Build Status](https://travis-ci.org/mad-lab/transit.svg?branch=master)](https://travis-ci.org/mad-lab/transit)   [![Documentation Status](https://readthedocs.org/projects/transit/badge/?version=latest)](http://transit.readthedocs.io/en/latest/?badge=latest)   [![Downloads](https://pepy.tech/badge/tnseq-transit)](https://pepy.tech/project/tnseq-transit)
+
 =======
 
+**NOTE: TRANSIT v3.0+ now requires python3.6+. If you want to use TRANSIT with python2, use version < 3.0.**
+
 Welcome! This is the distribution for the TRANSIT and TPP tools developed by the [Ioerger Lab](http://orca2.tamu.edu/tom/iLab.html) at Texas A&M University.
 
 TRANSIT is a tool for processing and statistical analysis of Tn-Seq data.
@@ -22,7 +25,7 @@ TRANSIT offers a variety of features including:
 
 -   Ability to analyze datasets from libraries constructed using  **himar1 or tn5 transposons**.
 
--   **TrackView** to help visualize read-counts accross the genome.
+-   **TrackView** to help visualize read-counts across the genome.
 
 -   Can **export datasets** into a variety of formats, including **IGV**.
 


=====================================
setup.py
=====================================
@@ -101,6 +101,7 @@ setup(
 
     description='TRANSIT is a tool for the analysis of Tn-Seq data. It provides an easy to use graphical interface and access to three different analysis methods that allow the user to determine essentiality in a single condition as well as between conditions.',
     long_description=long_description,
+    long_description_content_type='text/markdown',
 
     # The project's main homepage.
     url='https://github.com/mad-lab/transit',


=====================================
src/pytpp/__main__.py
=====================================
@@ -39,6 +39,10 @@ def run_main():
     main(*args, **kwargs)
 
 def main(*args, **kwargs):
+    # Check python version
+    if (sys.version_info[0] < 3):
+        print("TRANSIT v3.0+ requires python3.6+. To use with python2, please install TRANSIT version < 3.0")
+        sys.exit(0)
     vars = Globals()
     # Check for arguements
     if not args and not kwargs and hasWx:        


=====================================
src/pytpp/tpp_tools.py
=====================================
@@ -1442,6 +1442,7 @@ def show_help():
   print('    -maxreads <INT>')
   print('    -mismatches <INT>  # when searching for constant regions in reads 1 and 2; default is 1')
   print('    -flags "<STRING>"  # args to pass to BWA')
+  print('    -bwa-alg [aln|mem]  # Default: mem. Algorithm to use for mapping reads with bwa' )
   print('    -primer-start-window INT,INT # position in read to search for start of primer; default is [0,20]')
   print('    -window-size INT   # automatic method to set window')
   print('    -barseq_catalog_in|-barseq_catalog_out <file>')


=====================================
src/pytransit/__init__.py
=====================================
@@ -2,6 +2,6 @@
 __all__ = ["transit_tools", "tnseq_tools", "norm_tools", "stat_tools"]
 
 
-__version__ = "v3.0.0"
+__version__ = "v3.0.1"
 prefix = "[TRANSIT]"
 


=====================================
src/pytransit/__main__.py
=====================================
@@ -43,6 +43,10 @@ def run_main():
     main(*args, **kwargs)
 
 def main(*args, **kwargs):
+    # Check python version
+    if (sys.version_info[0] < 3):
+        print("TRANSIT v3.0+ requires python3.6+. To use with python2, please install TRANSIT version < 3.0")
+        sys.exit(0)
     #If no arguments, show GUI:
     DEBUG = "--debug" in sys.argv
     if DEBUG:


=====================================
src/pytransit/analysis/gi.py
=====================================
@@ -40,10 +40,10 @@ long_name = "Genetic Interactions"
 short_desc = "Genetic interactions analysis for change in enrichment"
 long_desc = """Method for determining genetic interactions based on changes in enrichment (i.e. delta log fold-change in mean read counts).
 
-NOTE: This method requires 4 groups of datasets. Use the main interface to add datasets for the two strain backgrounds under the first condition. A window will allow you to add the datasets under the second condition.
+NOTE: This method requires 4 groups of datasets. Use the main interface to add datasets for the two strain backgrounds under the first condition. After pressing "Run GI", a subsequent window will allow you to add the datasets under the second condition.
 """
 
-transposons = []
+transposons = ["himar1"]
 columns = ["Orf","Name","Number of TA Sites","Mean count (Strain A Condition 1)","Mean count (Strain A Condition 2)","Mean count (Strain B Condition 1)","Mean count (Strain B Condition 2)", "Mean logFC (Strain A)", "Mean logFC (Strain B)", "Mean delta logFC","Lower Bound delta logFC","Upper Bound delta logFC", "Prob. of delta-logFC being within ROPE", "Adjusted Probability (BFDR)", "Is HDI outside ROPE?", "Type of Interaction"]
 
 


=====================================
src/pytransit/analysis/multitnseq.py deleted
=====================================
@@ -1,572 +0,0 @@
-import sys
-import os
-import io
-
-hasR = False
-try:
-    import rpy2.robjects
-    hasR = True
-except Exception as e:
-    hasR = False
-
-import base
-import numpy
-import scipy
-import pdfkit
-import statsmodels.stats.multitest
-import statsmodels.stats.multicomp
-import pytransit.transit_tools as transit_tools
-import pytransit.tnseq_tools as tnseq_tools
-import pytransit.norm_tools as norm_tools
-from multitnseq_helpers import def_r_samples_corrplot, def_r_clustering, def_r_make_heatmap, def_r_conditions_corrplot
-
-###### GUI ELEMENTS ######
-short_name = "MultiTnSeq"
-long_name = "MultiTnSeq"
-short_desc = "Run the MultiTnSeq pipeline"
-long_desc = """Run the MultiTnSeq pipeline"""
-transposons = ["", ""]
-EOL = "\n"
-
-class MultiTnSeqAnalysis(base.TransitAnalysis):
-    def __init__(self):
-        base.TransitAnalysis.__init__(self, short_name, long_name, short_desc, long_desc, transposons, MultiTnSeqMethod)
-
-class MultiTnSeqMethod(base.AnalysisMethod):
-    """
-    MultiTnSeq pipeline, uses R scripts
-    """
-    def __init__(self, fna, annotation, combined_wig, metadata, cond_metadata, output_file, K, D, debatch, stdnorm):
-        base.AnalysisMethod.__init__(self, short_name, long_name, short_desc, long_desc, output_file, annotation)
-        self.ref = fna
-        self.annotation = annotation
-        self.combined_wig = combined_wig
-        self.samples_metadata_fname = metadata
-        self.conditions_metadata_fname = cond_metadata
-        self.output_file = output_file
-        self.K = K
-        self.D = D
-        self.debatch = False
-        self.stdnorm = False
-        self.rundir = ""
-
-    @classmethod
-    def fromargs(self, rawargs):
-        if not hasR:
-            print("Error: R and rpy2 (< 2.9.0) required to run ZINB analysis.")
-            print("After installing R, you can install rpy2 using the command \"pip install 'rpy2<2.9.0'\"")
-            sys.exit(0)
-
-        (args, kwargs) = transit_tools.cleanargs(rawargs)
-
-        if (kwargs.get('-help', False) or kwargs.get('h', False)):
-            print(MultiTnSeqMethod.usage_string())
-            sys.exit(0)
-
-        fna = args[0]
-        annotation = args[1]
-        combined_wig = args[2]
-        metadata = args[3]
-        cond_metadata = args[4]
-        output_file = args[5]
-        K = kwargs.get("K", 10)
-        D = kwargs.get("D", 5)
-        debatch = kwargs.get("-debatch", False)
-        stdnorm = kwargs.get("-stdnorm", False)
-
-        return self(fna, annotation, combined_wig, metadata, cond_metadata, output_file, K, D, debatch, stdnorm)
-
-    def read_genes(self, fname, descriptions=False):
-      genes = []
-      for line in open(fname):
-        w = line.split('\t')
-        data = [int(w[1])-1,int(w[2])-1,w[8],w[7],w[3]]
-        if descriptions==True: data.append(w[0])
-        genes.append(data)
-      return genes
-
-    def read_genome(self, filename):
-      s = ""
-      n = 0
-      for line in open(filename):
-        if n==0: n = 1 # skip first
-        else: s += line[:-1]
-      return s
-
-    def hash_genes(self, genes,genome):
-      hash = {}
-      for gene in genes:
-        a,b = gene[0],gene[1]
-        for i in range(a,b+1):
-          hash[i] = gene
-      return hash
-
-    def makefname(self, name,dir):
-      if dir=="": return name
-      else: return "%s/%s" % (dir,name)
-
-    def Run(self):
-      print("Running")
-      # self.generate_pdf()
-      # self.transit_message("Done.")
-
-      Samples = tnseq_tools.Spreadsheet(self.makefname(self.samples_metadata_fname, self.rundir))
-      Conditions = tnseq_tools.Spreadsheet(self.makefname(self.conditions_metadata_fname, self.rundir))
-
-      genome = self.read_genome(self.makefname(self.ref,self.rundir))
-      genes = self.read_genes(self.makefname(self.annotation,self.rundir))
-      ghash = self.hash_genes(genes,genome)
-
-      datasets = Samples.getcol("Filename") # filenames
-
-      (sites, data, filenamesInCombWig) = tnseq_tools.read_combined_wig(self.makefname(self.combined_wig, self.rundir)) # data: rows are TA sites, cols are datasets
-      # "datasets" are columns (indexes) in combined wig (data)
-      # SampleIndexes are list of rows in Samples to be analyzed (contains Filenames and Conditions), i.e. which datasets (row indexes in data) represent each condition?
-      # wigindexes are corresponding columns in data (from original order, matched by Filename)
-
-      SampleIndexes,wigindexes = [],[]
-      for cond in Conditions.keys:
-        for row in range(Samples.nrows):
-          if Samples.get(row,"Condition") == cond:
-            fname = Samples.get(row,"Filename")
-            if fname not in filenamesInCombWig:
-              print "error: filename '%s' listed in samples metadata not found in combined wig file" % fname; sys.exit(0)
-            SampleIndexes.append(row)
-            wigindexes.append(filenamesInCombWig.index(fname))
-
-      # reordering data matrix (select columns, order by cond) for normalization; now should be parallel to SampleIndexes
-      data = data[wigindexes, :]
-      data = data[wigindexes, :]
-      neworder = [filenamesInCombWig[i] for i in wigindexes]
-      filenamesInData = neworder
-
-      # foreach cond, list of indexes into data
-      cond2datasets = {}
-      for cond in Conditions.keys:
-        cond2datasets[cond] = []
-        for row in range(Samples.nrows):
-          if Samples.get(row,"Condition")==cond:
-            fname = Samples.get(row,"Filename")
-            cond2datasets[cond].append(filenamesInData.index(fname))
-        if len(cond2datasets[cond]) == 0:
-          print "error: no samples found in metadata for condition %s" % cond; sys.exit(0)
-
-      (normed, factors) = norm_tools.normalize_data(data, method='TTR')
-      Nds,Nsites = normed.shape # transposed: rows are datasets, cols are TA sites
-
-      file = open(self.makefname("temp_counts_TTR.txt", self.rundir),"w")
-      vals = "coord ORF gene".split()+[Samples.get(r,"Id") for r in SampleIndexes] # in order of Conditions
-      file.write('\t'.join(vals)+EOL)
-      for i,co in enumerate(sites):
-        rv,gene = "igr","igr"
-        if co in ghash:
-          annot = ghash[co]
-          rv,gene = annot[2],annot[3]
-        vals = [str(co),rv,gene]+["%0.1f" % (int(x)) for x in list(normed[:,i])]
-        file.write('\t'.join([str(x) for x in vals])+EOL)
-      file.close()
-
-      r_correlation = def_r_samples_corrplot()
-      r_correlation("temp_counts_TTR.txt")
-
-      # write table of stats (saturation,NZmean)
-      file = open(self.makefname("temp_stats.txt", self.rundir),"w")
-      file.write("dataset\tdensity\tmean_ct\tNZmean\tNZmedian\tmax_ct\ttotal_cts\tskewness\tkurtosis\n")
-      for i in range(data.shape[0]):
-        density, meanrd, nzmeanrd, nzmedianrd, maxrd, totalrd, skew, kurtosis = tnseq_tools.get_data_stats(data[i,:])
-        vals = [filenamesInData[i], "%0.2f" % density, "%0.1f" % meanrd, "%0.1f" % nzmeanrd, "%d" % nzmedianrd, maxrd, totalrd, "%0.1f" % skew, "%0.1f" % kurtosis]
-        file.write('\t'.join([str(x) for x in vals])+EOL)
-      file.close()
-
-      ################################
-
-      # new version, hashes on Rv
-
-      siteshash = {}
-      for i,TA in enumerate(sites): siteshash[TA] = i
-
-      TAsites = {}
-      for g,(start,end,Rv,gene,strand) in enumerate(genes):
-        siteindexes = []
-        for i in range(start,end): # end+1?
-          co = i+1
-          if co in siteshash: siteindexes.append(siteshash[co])
-        TAsites[Rv] = siteindexes
-
-
-
-      if False: # print means for each gene for each dataset
-        print '\t'.join(filenamesInData)
-        print '\t'.join([Samples.get(x,"Id") for x in SampleIndexes])
-        for (start,end,Rv,gene,strand) in genes:
-          if len(TAsites[Rv])>0: # skip genes with no TA sites
-            means = []
-            for j in range(Nds):
-              obs = normed[j,TAsites[Rv]]
-              means.append(numpy.mean(obs))
-          print '\t'.join([Rv,gene,str(len(TAsites[Rv]))]+["%0.1f" % x for x in means])
-        #sys.exit(0)
-
-
-      Counts = {} # sub-arrays of (datasets X sites) for genes X conds, where #TAs>0
-      for (start,end,Rv,gene,strand) in genes:
-        siteindexes = TAsites[Rv]
-        local = normed[:,siteindexes]
-        counts = []
-        for j in range(Conditions.nrows):
-          cond = Conditions.get(j,"Condition")
-          obs = local[cond2datasets[cond],:]
-          counts.append(obs) # sub-matrices
-        Counts[Rv] = counts
-
-      Means = {}
-      for (start,end,Rv,gene,strand) in genes:
-       if len(TAsites[Rv])>0: # skip genes with no TA sites
-        means = []
-        for j in range(Conditions.nrows):
-          cond = Conditions.get(j,"Condition")
-          obs = Counts[Rv][j]
-          means.append(numpy.mean(obs))
-        Means[Rv] = means
-
-      file = open(self.makefname("temp_gene_means.txt", self.rundir),"w")
-      file.write('\t'.join("ORF Gene TAs".split()+Conditions.getcol('Condition'))+EOL)
-      for (start,end,Rv,gene,strand) in genes:
-        if Rv not in Means: continue # skip genes with no TA sites
-        means,sites = Means[Rv],TAsites[Rv]
-        file.write('\t'.join([Rv,gene,str(len(sites))]+["%0.1f" % x for x in means])+EOL)
-      file.close()
-
-      ################################
-      # do ANOVA to identify genes with significant variability
-      # it is much faster to do this in python than R (and don't have to create the melted file)
-
-      pvals,Rvs, = [],[] # lists
-      for (start,end,Rv,gene,strand) in genes:
-       if Rv in Means:
-        countsvec = Counts[Rv]
-        countsvec = [x.flatten() for x in countsvec]
-        stat,pval = scipy.stats.f_oneway(*countsvec)
-        pvals.append(pval)
-        Rvs.append(Rv)
-
-      pvals = numpy.array(pvals)
-      mask = numpy.isfinite(pvals)
-      qvals = numpy.full(pvals.shape,numpy.nan)
-      qvals[mask] = statsmodels.stats.multitest.fdrcorrection(pvals[mask])[1] # BH, alpha=0.05
-
-      p,q = {},{}
-      for i,rv in enumerate(Rvs):
-         p[rv],q[rv] = pvals[i],qvals[i]
-      pvals,qvals = p,q, # hashes on Rv
-
-      # post-hoc analysis to identify "high" and "low" subgroups
-      Sortedmeans,Divisions = {},{} # sortedmeans: list of (mean,cond); divisions: list of (lowsubset,highsubset) or None
-      for (start,end,Rv,gene,strand) in genes:
-        if Rv in qvals and qvals[Rv]<0.05:
-          #print Rv,gene
-          countsvec = Counts[Rv]
-          countsvec = [x.flatten() for x in countsvec]
-          allcounts,allconds = numpy.array([]),numpy.array([])
-          for j in range(len(Conditions.keys)):
-            allcounts = numpy.append(allcounts,countsvec[j])
-            for k in range(len(countsvec[j])):
-              allconds = numpy.append(allconds,Conditions.keys[j])
-          mc = statsmodels.stats.multicomp.MultiComparison(allcounts,allconds)
-          tuk = mc.tukeyhsd() # alpha=0.1
-          #print tuk
-          reject,n = {},0
-          groups = tuk.groupsunique.tolist()
-          for j,group1 in enumerate(groups):
-            for k,group2 in enumerate(groups):
-              if j<k: reject[(group1,group2)] = reject[(group2,group1)] = tuk.reject[n]; n += 1
-          sortedmeans = [(numpy.mean(countsvec[k]),Conditions.keys[k]) for k in range(len(countsvec))] # list of (mean,condname)
-          sortedmeans = sorted(sortedmeans)
-          Sortedmeans[Rv] = sortedmeans
-          sortedgroups = [x[1] for x in sortedmeans]
-          #for (m,c) in sortedmeans: print c,"%0.1f" % m
-          candidate_divisions = [] # lists of condition names in 2 subgroups (lower and higher, but not necessarily in that order)
-          for j in range(1,len(sortedmeans)):
-            lowsubset,highsubset = sortedgroups[:j],sortedgroups[j:]
-            alldistinct = True
-            for group1 in lowsubset:
-              for group2 in highsubset: alldistinct = alldistinct and reject[(group1,group2)]; # print group1,group2,reject[(group1,group2)]
-            diff = sortedmeans[j][0]-sortedmeans[j-1][0] # could be negative if conds not sorted
-            vals = lowsubset+['<<<--->>>']+highsubset+["%s" % alldistinct,"%0.1f" % diff]
-            #print '\t'.join(vals)
-            if diff<0: diff,lowsubset,highsubset = -diff,highsubset,lowsubset
-            if alldistinct==True: candidate_divisions.append((diff,lowsubset,highsubset))
-          if len(candidate_divisions)>0:
-            candidate_divisions.sort(reverse=True)
-            (diff,lowsubset,highsubset) = candidate_divisions[0]
-            Divisions[Rv] = (lowsubset,highsubset)
-
-    #The tukeyhsd of statsmodels doesn't return P value.
-    #So, if you want to know P value, calculate from these outputted value or use R.
-    #After saving the output into a variable res, you can get p-values by applying
-    # psturng(np.abs(res.meandiffs / res.std_pairs), len(res.groupsunique), res.df_total), where psturng comes from from statsmodels.stats.libqsturng import psturng
-
-      # save pvals as temp_anova.txt
-      file = open(self.makefname("temp_anova.txt", self.rundir),"w")
-      vals = "Rv Gene TAs".split()+Conditions.keys+"pval padj".split()
-      file.write('\t'.join(vals)+EOL)
-      for (start,end,Rv,gene,strand) in genes:
-       if Rv in Means:
-        m,s = numpy.mean(Means[Rv]),numpy.std(Means[Rv])
-        scv = s/m
-        vals = [Rv,gene,str(len(TAsites[Rv]))]+["%0.1f" % x for x in Means[Rv]]+["%f" % x for x in [pvals[Rv],qvals[Rv]]] # could append SCV, but beware of how multi_TnSeq.py counts hits in temp_anova.txt
-        file.write('\t'.join(vals)+EOL)
-      file.close()
-
-      # write sorted means and high/low subgroups into temp_subgroups.txt
-      file = open(self.makefname("temp_divisions.txt", self.rundir),"w")
-      vals = "Rv Gene TAs".split()+Conditions.keys
-      file.write('\t'.join(vals)+EOL)
-      for (start,end,Rv,gene,strand) in genes:
-       if Rv in qvals and qvals[Rv]<0.05:
-        vals = [Rv,gene,str(len(TAsites[Rv]))]+["%0.1f" % x for x in Means[Rv]]
-        vals += [x[1] for x in Sortedmeans[Rv]]+["%0.1f" % x[0] for x in Sortedmeans[Rv]]
-        if Rv in Divisions:
-          (lowsubset,highsubset) = Divisions[Rv]
-          vals += [','.join(lowsubset),','.join(highsubset)]
-        file.write('\t'.join(vals)+EOL)
-      file.close()
-
-
-      if False: # print gene means for each condition, and batch means
-        for (start,end,Rv,gene,strand) in genes:
-         vals = []
-         if Rv in Means:
-          batches = {}
-          for r in range(Conditions.nrows):
-            batch = Conditions.get(r,"Batch")
-            if batch not in batches: batches[batch] = []
-            batches[batch].append(Means[Rv][r])
-            vals.append(Means[Rv][r]) ###
-          for b in "CC1 CC2 CC3 KO".split(): vals.append(numpy.mean(batches[b]))
-          print '\t'.join([Rv,gene]+["%0.1f" % x for x in vals]) ###
-      ###sys.exit(0)
-
-
-
-      # remove batch effects (subtract batch means from mean gene counts; adjust each to global mean)
-      if self.debatch and "Batch" in Conditions.headers:
-        print "<BR>correcting means for batch effects..."
-        for (start,end,Rv,gene,strand) in genes:
-         if Rv in Means:
-          batches = {}
-          for r in range(Conditions.nrows):
-            batch = Conditions.get(r,"Batch")
-            if batch not in batches: batches[batch] = []
-            batches[batch].append(Means[Rv][r])
-
-          globalmean = numpy.mean(Means[Rv])
-          keys = sorted(batches.keys())
-          batchmeans = {}
-          for b in keys: batchmeans[b] = numpy.mean(batches[b])
-
-          for r in range(Conditions.nrows):
-            batch = Conditions.get(r,"Batch")
-            delta = globalmean-batchmeans[batch] # correction for each batch
-            Means[Rv][r] = max(0,Means[Rv][r]+delta)
-
-      if False: # print gene means for each condition (with batch corrections)
-        vals = ['ORF','gene']+Conditions.getcol("Condition")
-        print '\t'.join(vals)
-        for (start,end,Rv,gene,strand) in genes:
-           if Rv not in Means: continue
-           vals = [Means[Rv][r] for r in range(Conditions.nrows)]
-           print '\t'.join([Rv,gene]+["%0.1f" % x for x in vals])
-        sys.exit(0)
-
-      ################################
-      # calc LFCs
-
-      # consider normalizing by reference conditions?
-      # consider only calculating for non-reference conditions?
-
-      lfcs = {}
-      for (start,end,Rv,gene,strand) in genes:
-       if Rv in qvals and qvals[Rv]<0.05:
-        lfcvec = []
-        for j in range(Conditions.nrows): # could remove means for reference conditions
-          a = Means[Rv][j]
-          b = numpy.median(Means[Rv])
-          PC = 5
-          lfc = numpy.log2((a+5)/float(b+5))
-          lfcvec.append(lfc)
-        lfcs[Rv] = lfcvec
-
-      if self.stdnorm:
-        print "<BR>applying standard normalization to LFCs"
-        for i,cond in enumerate(Conditions.keys):
-          col = [row[i] for row in lfcs.values()]
-          m,s = numpy.mean(col),numpy.std(col) # consider using quantiles(col,[0,0.25,0.5,0.75,1.0])
-          for Rv in lfcs.keys():
-            lfcs[Rv][i] = (lfcs[Rv][i]-m)/s
-
-      file = open(self.makefname("temp_LFCs.txt", self.rundir),"w")
-      vals = "Rv Gene".split()+Conditions.keys # or non-ref-conds
-      file.write('\t'.join(vals)+EOL)
-      cnt = 0
-      for (start,end,Rv,gene,strand) in genes:
-       if Rv in qvals and qvals[Rv]<0.05:
-        vals = [Rv,gene]+["%0.3f" % x for x in lfcs[Rv]]
-        file.write('\t'.join([str(x) for x in vals])+EOL)
-        cnt += 1
-      file.close()
-      if cnt==0: print "error: no significantly varying genes found by ANOVA"; return
-
-      r_conditions_corrplot = def_r_conditions_corrplot()
-      r_make_heatmap = def_r_make_heatmap()
-      r_clustering = def_r_clustering()
-
-      r_conditions_corrplot("temp_LFCs.txt")
-      r_make_heatmap("temp_LFCs.txt")
-      r_clustering("temp_LFCs.txt", self.K, self.D)
-
-      self.generate_pdf()
-      self.transit_message("Done.")
-
-    def to_html(self):
-        stats_table = '<table border="1" cellpadding="5" cellspacing="0">'
-        for line in open("temp_stats.txt"):
-          w = line.rstrip().split('\t')
-          stats_table += "<tr>"
-          for val in w:
-              stats_table += "<td>" + val + "</td>"
-          stats_table += "</tr>"
-        stats_table += '</table>'
-        lfcs_path = os.getcwd() + "/lfcs_boxplot.png"
-
-        corr_plot = '<p><br>lfcs_boxplot.png:<br><img src="{0}"></p>'.format(lfcs_path)
-
-        anova_varying = 0
-        skip = 1
-        for line in open("temp_anova.txt"):
-            # skip header lines
-            if skip > 0: skip -= 1; continue
-            w = line.rstrip().split("\t")
-            if w[-1] != 'nan' and float(w[-1]) < 0.05: anova_varying += 1
-
-        anova_results = '<p>ANOVA finds {0} significantly varying genes'.format(anova_varying)
-        conditions_corrplot_path = os.getcwd() + "/conditions_corrplot.png"
-        conditions_corrplot = '<p><br>conditions_corrplot.png:<br><img src="{0}"></p>'.format(conditions_corrplot_path)
-        lfcs_heatmap = "<p><br>heatmap.png:<BR><img src='{0}'></p>".format(os.getcwd() + "/heatmap.png")
-        cluster_analysis = ''' '''
-
-        return '''
-            <p>
-            Links
-                <ul>
-                    <li> <a href="temp_stats.txt">spreadsheet of statistics on TnSeq datasets</a>
-                    <li> <a href="temp_counts_TTR.txt">spreadsheet of normalized counts</a>
-                    <li> <a href="temp_anova.txt">spreadsheet with gene means and Anova results</a>
-                    <li> <a href="temp_clust.txt">spreadsheet with LFCs and clusters</a>
-                    <li> <a href="temp_divisions.txt">spreadsheet with divisions of conditions into low and high subgroups for significantly varying genes </a>
-                    <li> <a href="temp_scores.txt"> spreadsheet with loadings of genes onto Varimax dimensions, along with normalized associations</a>
-                    <li> <a href="PFilteredValues.txt"> text file with enriched functional categories in each clusters</a>
-                </ul>
-            </p>
-            <hr>
-            <H3>Quality Control</H3>
-            <P>Statistics of TnSeq datasets:<BR>
-            {stats_table}
-            {corr_plot}
-            <hr>
-            {anova_results}
-            <br>
-            Conditions corrplot is based on significantly varying genes
-            {conditions_corrplot}
-            <hr>
-            <h3>Cluster Analysis</h3>
-            <p>LFCs are relative to the median across all conditions</p>
-            {lfcs_heatmap}
-        '''.format(
-                stats_table=stats_table,
-                corr_plot=corr_plot,
-                anova_results=anova_results,
-                conditions_corrplot=conditions_corrplot,
-                lfcs_heatmap=lfcs_heatmap
-            )
-# print "<HR>"
-# print "<H3>Cluster Analysis</H3>"
-
-# print "<P>LFCs are relative to the median across all conditions"
-# newfname = "%s/heatmap.png" % dir
-# print "<BR>heatmap.png:<BR><img src='%s'>" % newfname
-
-# newfname = "%s/cluster_opt.png" % dir
-# print "<P>cluster_opt.png:<BR><img src='%s'>" % newfname
-
-# newfname = "%s/pca_genes.png" % dir
-# print "<P>LD-PCA, colored by Kmeans cluster (K=%s)" % K
-# print "<BR>pca_genes.png:<BR><img src='%s'>" % newfname
-
-# newfname = "%s/hclust_genes.png" % dir
-# print "<P>hclust_genes.png:<BR><img src='%s'>" % newfname
-
-# newfname = "%s/hclust_conditions.png" % dir
-# print "<P>hclust_conditions.png:<BR><img src='%s'>" % newfname
-
-
-# print "<HR>"
-
-# print "<H3>Pathway Enrichment</H3>"
-
-# print """This analysis shows which functional categories are enriched in each cluster.
-# It uses the Sanger annotation (based on Cole's original descriptions of ORF functions,
-# with updates.  Only the categories with <I>adjusted p-values</I> < 0.05 are shown below.
-# P-values were calculated using the Hypergeometric distribution, and
-# the multiple-test correction by the Benjamini-Hochberg procedure
-# was applied to control the FDR at 5%."""
-
-# from functionPathWay import sangerClassification
-# sangerRoles="H37Rv.sanger_roles2"
-# protTable="%s/ref.prot_table" % dir
-# clusters="%s/temp_clust.txt" %dir
-# resultFileName="%s/PFilteredValues.txt" % dir
-# sangerClassification(sangerRoles,protTable,clusters,resultFileName)
-
-# print "<PRE>"
-# for line in open(resultFileName):
-#   print line,
-# print "</PRE>"
-
-# print "<HR>"
-
-# print "<H3>Principle Component Analysis</H3>"
-
-
-# print "<P>Mapping of activators onto Principle Components"
-# newfname = "%s/pca_conditions.png" % dir
-# print "<BR>pca_conditions.png:<BR><img src='%s'>" % newfname
-
-# newfname = "%s/condition_PCs.png" % dir
-# print "<P>condition_PCs.png:<BR><img src='%s'>" % newfname
-
-# newfname = "%s/varimax.png" % dir
-# print "<P>varimax.png:<BR><img src='%s'>" % newfname
-# print "</BODY></HTML>"
-
-    def generate_pdf(self):
-        html_content = self.to_html()
-        file_name = os.path.splitext(os.path.basename(self.output_file))[0]
-        with open("{0}.html".format(file_name), 'w') as f:
-            f.write(html_content)
-        pdfkit.from_string(html_content, "{0}.pdf".format(file_name))
-
-    @classmethod
-    def usage_string(self):
-        return """
-            python %s multitnseq <ref.fna> <ref.prot_table> <combined_wig> <samples_metadata> <conditions_metadata> <output_filename> [-K n] [-D n] [-debatch] [-stdnorm]
-        """ % (sys.argv[0])
-
-def main():
-    print("MultiTnseq example.")
-
-    return self()
-
-if __name__ == "__main__":
-    main()
-


=====================================
src/pytransit/analysis/multitnseq_helpers.py deleted
=====================================
@@ -1,203 +0,0 @@
-from rpy2.robjects import r, globalenv
-
-def def_r_clustering():
-    r('''
-    r_clustering = function(data_filename, K, D) {
-        # input: temp_LFCs.txt
-        data = read.table(data_filename,head=T,sep='\t')
-        lfcs = data[,3:length(colnames(data))]
-        labels = paste(data$Rv,'/',data$Gene,sep="")
-        rownames(lfcs) = labels
-
-        R = length(rownames(lfcs)) # genes
-        C = length(colnames(lfcs)) # conditions
-
-        library(corrplot)
-
-        fname = "lfcs_boxplot.png"
-        png(fname,width=300+15*C,height=500)
-        boxplot(lfcs,las=2,ylab="log2 fold change of genes (relative to the median)")
-        dev.off()
-
-        km = kmeans(lfcs,K)
-        clusters = km$cluster
-
-        write.table(data.frame(lfcs,clust=clusters),"temp_clust.txt", sep='\t', quote=F)
-
-        pca = prcomp(t(lfcs))
-
-        library(MASS)
-        mat = as.matrix(lfcs)
-        ldapca = lda(clusters~.,lfcs)
-        X = mat %*% ldapca$scaling[,1]
-        Y = mat %*% ldapca$scaling[,2]
-        fname = "pca_genes.png"
-        png(fname,width=1000,height=1000)
-        plot(X,Y,pch=20)
-        text(X,Y,label=labels,adj=c(0.5,1.5),col=clusters)
-        dev.off()
-
-        actpca = prcomp(lfcs,center=TRUE,scale=TRUE)
-        fname = "pca_conditions.png"
-        png(fname,width=500,height=500)
-        plot(actpca$rotation[,1:2],pch=20)#,xlim=c(-1,1),ylim=c(-1,1))
-        text(actpca$rotation[,1:2],label=colnames(lfcs),adj=c(0.5,1.5),cex.lab=1.5)
-        dev.off()
-
-        print(length(lfcs[,1]))
-        hc = hclust(dist(lfcs),method="ward.D2")
-        print("****hclust_genes****")
-        print(hc)
-        fname = "hclust_genes.png"
-        png(fname,width=300+15*R,height=600)
-        plot(hc)
-        K2 = max(2,min(K,max(hc$height)))
-        rect.hclust(hc,k=K2,border='red')
-        dev.off()
-
-        hc = hclust(dist(t(lfcs)),method="ward.D2")
-        print("****hclust_conditions****")
-        print(hc)
-        fname = "hclust_conditions.png"
-        png(fname,width=300+15*C,height=600)
-        # plot(hc)
-        # D2 = max(2,min(D,C-1))
-        # rect.hclust(hc,k=D2,border='red')
-        # dev.off()
-
-        fname = "cluster_opt.png"
-        png(fname)
-        library(factoextra)
-        fviz_nbclust(lfcs, kmeans, method = "wss",k.max=30) # or silhouette or gap_stat
-        #fviz_nbclust ( lfcs, kmeans, method = "wss",k.max=30)+geom_hline ( yintercept=52 )+geom_vline(xintercept=10)
-        dev.off()
-
-        ####################################################
-        # required factoextra and corrplot libraries
-
-        var = get_pca_var(actpca) 
-        fname = "condition_PCs.png"
-        png(fname,width=1000,height=1000)
-        corrplot(var$cor,main="Principle Components",mar=c(1,1,1,1))
-        dev.off()
-
-        S = diag(actpca$sdev,D,D)
-        rawLoadings = actpca$rotation[,1:D] %*% S
-        vmax = varimax(rawLoadings)
-        rotatedLoadings = vmax$loadings
-        rotatedScores = scale(actpca$x[,1:D]) %*% vmax$rotmat
-
-        # https://stats.stackexchange.com/questions/59213/how-to-compute-varimax-rotated-principal-components-in-r
-        # scores = U.S = X.V = actpca$x = scale(lfcs) %*% actpca$rotation
-        combined_transform = (actpca$rotation[,1:D] %*% S) %*% vmax$rotmat
-        # vmax$loadings is same as combined_transform, but with abs(vals)<0.1 blanked out
-        write.table(round(combined_transform,6),"varimax_loadings.txt",sep='\t',quote=F)
-
-        fname = "varimax.png"
-        png(fname,width=800,height=800)
-        corrplot(rotatedLoadings,main="Varimax loadings",mar=c(1,1,1,1)) 
-        dev.off()
-
-        scores = rotatedScores
-        colnames(scores) = paste("score",seq(1:D),sep="")
-        squares = scores*scores
-        ss = apply(squares,1,sum)
-        cos2 = squares/ss
-        best = as.matrix(apply(cos2,1,max))
-        assoc = cos2*sign(scores)
-        colnames(assoc) = paste("assoc",seq(1:D),sep="")
-
-        # # generate null distribution for cos2 to closest axis
-        # 
-        # library(MASS)
-        # SAMPLES = 10000
-        # sample = mvrnorm(n=SAMPLES,mu=rep(0,D),Sigma=diag(D))
-        # # no need to apply Varimax rotation
-        # squares = sample*sample
-        # ss = apply(squares,1,sum)
-        # Xcos2 = squares/ss
-        # Xbest = as.matrix(apply(Xcos2,1,max))
-        # #distn = ecdf(best)
-        # pvals = apply(best,1,function (x) { length(Xbest[Xbest>=x]) })
-        # pvals = pvals/SAMPLES
-        # padj = p.adjust(pvals,method="BH")
-         
-        #res = data.frame(data,scores,assoc,best,pvals,padj)
-        res = data.frame(data,scores,assoc)
-        write.table(format(res,digits=3,scientific=F), "temp_scores.txt",sep='\t',quote=F,row.names=F)
-
-    }
-    ''')
-    return globalenv['r_clustering']
-
-def def_r_samples_corrplot():
-    r('''
-    r_samples_corrplot = function(data_filename) {
-        data = read.table(data_filename, sep='\t',head=T)
-        vals = as.matrix(data[,4:length(colnames(data))])
-        N = length(colnames(vals))
-
-        library(corrplot)
-
-        fname = "samples_corrplot.png"
-        png(fname,width=300+20*N,height=300+20*N)
-        #corrplot(cor(vals)) # among TAsites (not good)
-
-        TAsites = aggregate ( coord~ORF,data=data,length)
-        colnames(TAsites) = c("ORF","sites")
-        temp = aggregate(vals,by=list(data$ORF),FUN=mean)
-        corrplot(cor(temp[TAsites$sites>=10,2:length(colnames(temp))]))
-
-        dev.off()
-    }''')
-
-    return globalenv['r_samples_corrplot']
-
-def def_r_conditions_corrplot():
-    r('''
-        r_conditions_corrplot = function(data_filename) {
-            data = read.table(data_filename, sep='\t', head=T)
-            vals = as.matrix(data[,3:length(colnames(data))])
-            N = length(colnames(vals))
-
-            library(corrplot)
-
-            png("conditions_corrplot.png", width=20*N+300, height=20*N+300)
-            corrplot(cor(vals))
-            dev.off()
-        }
-    ''')
-
-    return globalenv['r_conditions_corrplot']
-
-
-def def_r_make_heatmap():
-    r('''
-        r_make_heatmap = function(data_filename) {
-            data = read.table(data_filename, head=T, sep='\t')
-            lfcs = as.matrix(data[,3:length(colnames(data))])
-            labels = paste(data$Rv,'/',data$Gene,sep="")
-            rownames(lfcs) = labels
-
-            library(corrplot)
-            library(RColorBrewer)
-            library(gplots)
-
-            # Red for down regulated, green for upregulated. 
-            redgreen <- function(n) { c( hsv(h=0/6, v=seq(1,0,length=n/2) ), hsv(h=2/6, v=seq(0,1,length=n/2) ), ) }
-            colors <- colorRampPalette(c("red", "white", "green"))(n = 1000)
-
-            C = length(colnames(lfcs))
-            R = length(rownames(lfcs))
-            W = 300+C*30
-            H = 300+R*15
-
-            fname = "heatmap.png"
-            png(fname,width=W,height=H)
-            #par(oma=c(10,1,1,10)) # b,l,t,r; units="lines of space"
-            heatmap.2(lfcs,col=colors,margin=c(12,12),lwid=c(1,7),lhei=c(1,7),trace="none",cexCol=1.4,cexRow=1.4) # make sure white=0
-            dev.off()
-        }
-    ''')
-    return globalenv['r_make_heatmap']
-


=====================================
src/pytransit/analysis/pathway_enrichment.py
=====================================
@@ -177,13 +177,13 @@ class GSEAMethod(base.SingleConditionMethod):
       n = len(dict_n[key])
       I = set(dict_k) & set(dict_n[key])
       k = len(I)
-      result[key] = {"parameters":str(k)+"\t"+str(n),"p-value": hypergeom.sf(k,M,n,N), "Intersection":I}
+      result[key] = {"parameters":str(n)+"\t"+str(k),"p-value": hypergeom.sf(k,M,n,N), "Intersection":I}
     self.padjustForHypergeom(result)
     return result
 
 
   def padjustForHypergeom(self,result):
-    keys = result.keys()
+    keys = list(result.keys())
     pvalues = [result[k]["p-value"] for k in keys]
     b,adj=multitest.fdrcorrection(pvalues, alpha=0.05, method='indep')
     for i in range(len(keys)):
@@ -357,7 +357,7 @@ class GSEAMethod(base.SingleConditionMethod):
     self.output.close()  
 
   def padjustTest(self,test):
-    keys = test.keys()
+    keys = list(test.keys())
     pvals = [test[k][2] for k in keys]
     b,adj=multitest.fdrcorrection(pvals, alpha=0.05, method='indep')
     for i in range(len(keys)):
@@ -453,7 +453,7 @@ class GSEAMethod(base.SingleConditionMethod):
       self.output.write("#%s\n" % "\t".join(columns))
       self.saveExit(gseaVal,PathDict,rank,ORFNameDict,DESCR)
     elif self.M =="HYPE":
-      columns=["cat id descr","Total genes","Total in intersection","pval","padj","genes in intersection"]
+      columns=["cat id", "descr","Total genes","Total in intersection","pval","padj","genes in intersection"]
       self.output.write("#%s\n" % "\t".join(columns))
       self.saveHyperGeometricTest(results,ORFNameDict,DESCR)
     elif self.M =="GSEA-Z":



View it on GitLab: https://salsa.debian.org/med-team/tnseq-transit/commit/cad37841e08d266124e4f6f971a2e2aad871a44c

-- 
View it on GitLab: https://salsa.debian.org/med-team/tnseq-transit/commit/cad37841e08d266124e4f6f971a2e2aad871a44c
You're receiving this email because of your account on salsa.debian.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20191108/f54cc725/attachment-0001.html>


More information about the debian-med-commit mailing list