[med-svn] [Git][med-team/tnseq-transit][master] 8 commits: routine-update: New upstream version

Étienne Mollier (@emollier) gitlab at salsa.debian.org
Tue Oct 24 21:07:53 BST 2023



Étienne Mollier pushed to branch master at Debian Med / tnseq-transit


Commits:
44f203c9 by Étienne Mollier at 2023-10-24T21:10:39+02:00
routine-update: New upstream version

- - - - -
b554dbac by Étienne Mollier at 2023-10-24T21:10:41+02:00
New upstream version 3.3.1
- - - - -
09a593b0 by Étienne Mollier at 2023-10-24T21:13:34+02:00
Update upstream source from tag 'upstream/3.3.1'

Update to upstream version '3.3.1'
with Debian dir ec4f27b793442404339451534146808e12fa5247
- - - - -
9d9f5138 by Étienne Mollier at 2023-10-24T21:33:39+02:00
skip_test_requiring_non_existing_input_data.patch: forwarding not needed.

- - - - -
d8396250 by Étienne Mollier at 2023-10-24T21:35:42+02:00
d/clean: cleanup test artifacts.

Closes: #1047512

- - - - -
d1476c64 by Étienne Mollier at 2023-10-24T21:58:34+02:00
d/rules: verify symlink is dangling before move.

tests/H37Rv.fna is a dangling symlink in upstream tarball, and it is
causing test failures so needs to be moved away.  The move causes
subsequent builds to fail, so first check whether the link still
exists, and is still dangling.  As a side effect, the test will come
back by itself when the link is repaired eventually.

Closes: #1049672

- - - - -
161d55af by Étienne Mollier at 2023-10-24T22:03:26+02:00
fix_problematic_comparison.patch: typo in description.

- - - - -
2e42512a by Étienne Mollier at 2023-10-24T22:04:31+02:00
ready to upload to unstable.

- - - - -


26 changed files:

- + .readthedocs.yaml
- CHANGELOG.md
- debian/changelog
- + debian/clean
- debian/patches/fix_problematic_comparison.patch
- debian/patches/skip_test_requiring_non_existing_input_data.patch
- debian/rules
- + src/HMM_conf.py
- + src/allpairs_resampling.py
- src/pytransit/__init__.py
- src/pytransit/analysis/CGI.py
- src/pytransit/analysis/hmm.py
- + src/pytransit/data/iron_combined_wig4.txt
- + src/pytransit/data/iron_samples_metadata.txt
- + src/pytransit/doc/requirements.txt
- + src/pytransit/doc/source/example_scripts.rst
- + src/pytransit/doc/source/file_formats.rst
- src/pytransit/doc/source/index.rst
- src/pytransit/doc/source/method_HMM.rst
- src/pytransit/doc/source/method_anova.rst
- src/pytransit/doc/source/method_resampling.rst
- src/pytransit/doc/source/method_zinb.rst
- src/pytransit/doc/source/transit.rst
- src/pytransit/doc/source/transit_install.rst
- + src/pytransit_resampling.py
- tests/test_analysis_methods.py


Changes:

=====================================
.readthedocs.yaml
=====================================
@@ -0,0 +1,35 @@
+# Read the Docs configuration file for Sphinx projects
+# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details
+
+# Required
+version: 2
+
+# Set the OS, Python version and other tools you might need
+build:
+  os: ubuntu-22.04
+  tools:
+    python: "3.11"
+    # You can also specify other tool versions:
+    # nodejs: "20"
+    # rust: "1.70"
+    # golang: "1.20"
+
+# Build documentation in the "docs/" directory with Sphinx
+sphinx:
+  configuration: src/pytransit/doc/source/conf.py
+  # You can configure Sphinx to use a different builder, for instance use the dirhtml builder for simpler URLs
+  # builder: "dirhtml"
+  # Fail on all warnings to avoid broken references
+  # fail_on_warning: true
+
+# Optionally build your docs in additional formats such as PDF and ePub
+# formats:
+#    - pdf
+#    - epub
+
+# Optional but recommended, declare the Python requirements required
+# to build your documentation
+# See https://docs.readthedocs.io/en/stable/guides/reproducible-builds.html
+python:
+   install:
+   - requirements: src/pytransit/doc/requirements.txt


=====================================
CHANGELOG.md
=====================================
@@ -3,14 +3,23 @@ All notable changes to this project will be documented in this file.
 
 
 
-## Version 3.3.0 2023-08-03
+## Version 3.3.1 (2023-10-18)
+#### Transit:
+
+Minor changes:
+ - changed HMM output filenames to *.sites.txt and *.genes.txt
+ - added a page on File Formats to documentation
+ - added a page on Example Scripts to documentation (for developers, to illustrate use of pytransit package)
+
+
+## Version 3.3.0 (2023-08-03)
 #### Transit:
 
 Major changes:
  - added CRISPRi-DR method for identifying chemical-genetic interactions (CGI) in CRISPRi libraries
 
 
-## Version 3.2.8 2023-07-22
+## Version 3.2.8 (2023-07-22)
 #### TPP:
 
 Small bug fixes:
@@ -19,7 +28,7 @@ Small bug fixes:
  - added link in documentation to make users aware of Transit2
 
 
-## Version 3.2.7 2022-09-22
+## Version 3.2.7 (2022-09-22)
 #### TRANSIT:
 
 Major changes:


=====================================
debian/changelog
=====================================
@@ -1,3 +1,18 @@
+tnseq-transit (3.3.1-1) unstable; urgency=medium
+
+  * New upstream version 3.3.1
+  * skip_test_requiring_non_existing_input_data.patch: forwarding not needed.
+  * d/clean: cleanup test artifacts. (Closes: #1047512)
+  * d/rules: verify symlink is dangling before move.
+    tests/H37Rv.fna is a dangling symlink in upstream tarball, and it is
+    causing test failures so needs to be moved away.  The move causes
+    subsequent builds to fail, so first check whether the link still
+    exists, and is still dangling.  As a side effect, the test will come
+    back by itself when the link is repaired eventually. (Closes: #1049672)
+  * fix_problematic_comparison.patch: typo in description.
+
+ -- Étienne Mollier <emollier at debian.org>  Tue, 24 Oct 2023 22:03:54 +0200
+
 tnseq-transit (3.3.0-1) unstable; urgency=medium
 
   * New upstream version


=====================================
debian/clean
=====================================
@@ -0,0 +1,4 @@
+tests/data/test-multicontig.fna.*
+tests/test_tpp_temp_*
+tests/testoutput.*
+tpp.cfg


=====================================
debian/patches/fix_problematic_comparison.patch
=====================================
@@ -1,4 +1,4 @@
-Description: Do not use "is" for comparision
+Description: Do not use "is" for comparison
 Author: Nilesh Patra <nilesh at debian.org>
 Bug-Debian: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=984958
 Last-Update: 2021-03-22


=====================================
debian/patches/skip_test_requiring_non_existing_input_data.patch
=====================================
@@ -1,6 +1,7 @@
 Description: Exclude tests that need local data file
 Author: Andreas Tille <tille at debian.org>
 Last-Update: Tue, 18 Dec 2018 16:56:48 +0100
+Forwarded: not-needed
 
 --- a/tests/test_tpp.py
 +++ b/tests/test_tpp.py


=====================================
debian/rules
=====================================
@@ -15,7 +15,9 @@ override_dh_install:
 # Upstream tarball contains a dangling symlink to local data file - hack around this
 override_dh_auto_configure:
 	mkdir -p tests_invalid_data
-	mv tests/H37Rv.fna tests_invalid_data
+	if [ -L tests/H37Rv.fna ] && ! [ -e tests/H37Rv.fna ] \
+	; then mv tests/H37Rv.fna tests_invalid_data \
+	; fi
 	dh_auto_configure
 
 override_dh_auto_test:


=====================================
src/HMM_conf.py
=====================================
@@ -0,0 +1,85 @@
+import sys,numpy
+import scipy.stats
+
+data = []
+all = []
+Sats = {}
+NZmeans = {}
+TAs = {}
+Calls = {}
+for line in open(sys.argv[1]):
+  line = line.strip()
+  if line[0]=='#': continue
+  w = line.split('\t')
+  nTA = int(w[3])
+  if nTA==0: continue
+  votes = [int(x) for x in w[4:8]]
+  m = max(votes)
+  consistency = m/float(nTA)
+  id = w[0]
+  Sats[id] = float(w[-3])
+  NZmeans[id] = float(w[-2])
+  TAs[id] = nTA
+  Calls[id] = w[-1]
+  all.append(consistency)
+  data.append(w)
+
+print("# avg gene-level consistency of HMM states: %s" % (round(numpy.mean(all),4)))
+
+cons = numpy.mean(all)
+
+IDs = [x[0] for x in data]
+
+meanSats,meanNZmeans = {},{}
+stdSats,stdNZmeans = {},{}
+
+print("# state posterior probability distributions:")
+for st in ["ES","GD","NE","GA"]:
+  sub = list(filter(lambda id: Calls[id]==st,IDs))
+  meanSat = numpy.mean([Sats[id] for id in sub])
+  meanNZmean = numpy.mean([NZmeans[id] for id in sub])
+  stdSat = numpy.std([Sats[id] for id in sub])
+  stdNZmean = numpy.std([NZmeans[id] for id in sub])
+  print("#  %s: genes=%s, meanSat=%s, stdSat=%s, meanNZmean=%s, stdNZmean=%s" % (st,len(sub),round(meanSat,3),round(stdSat,3),round(meanNZmean,1),round(stdNZmean,1)))
+  meanSats[st] = meanSat
+  meanNZmeans[st] = meanNZmean
+  stdSats[st] = meanSat
+  stdNZmeans[st] = meanNZmean
+
+def calc_prob(sat,NZmean,meanSat,stdSat,meanNZmean,stdNZmean):
+  a = scipy.stats.norm.pdf(sat,loc=meanSat,scale=stdSat)
+  b = scipy.stats.norm.pdf(NZmean,loc=meanNZmean,scale=stdNZmean)
+  return a*b
+
+# second pass...
+
+for line in open(sys.argv[1]):
+  line = line.strip()
+  if line[0]=='#': print (line); continue
+  w = line.split('\t')
+  id,Call = w[0],w[-1]
+  nTA = int(w[3])
+  if nTA==0: print (line); continue
+  votes = [int(x) for x in w[4:8]]
+  m = max(votes)
+  consistency = m/float(nTA)
+  #prob = scipy.stats.binom.cdf(m,nTA,cons)
+  probs = []
+  STATES = "ES GD NE GA".split()
+  for st in STATES:
+    probs.append(calc_prob(float(w[-3]),float(w[-2]),meanSats[st],stdSats[st],meanNZmeans[st],stdNZmeans[st]))
+  totprob = sum(probs)
+  relprobs = [x/float(totprob) for x in probs]
+  conf = relprobs[STATES.index(Call)]
+  flag=""
+  if conf<0.5: flag="low-confidence"
+  if max(relprobs)<0.7:
+    if (Call=="ES" or Call=="GD") and (relprobs[0]>0.25 and relprobs[1]>0.25): flag="ambiguous"
+    if (Call=="GD" or Call=="NE") and  (relprobs[1]>0.25 and relprobs[2]>0.25) :flag="ambiguous"
+    if (Call=="NE" or Call=="GA") and (relprobs[2]>0.25 and relprobs[3]>0.25):flag="ambiguous"
+  
+  #vals = w+[nTA,m,round(consistency,3),round(prob,4)]
+  vals = w+[round(consistency,3)]+[round(x,6) for x in relprobs]
+  vals += [round(conf,4),flag]
+  print ('\t'.join([str(x) for x in vals]))
+


=====================================
src/allpairs_resampling.py
=====================================
@@ -0,0 +1,75 @@
+import sys,numpy
+import pytransit.transit_tools as transit_tools
+import pytransit.tnseq_tools as tnseq_tools
+import pytransit.stat_tools as stat_tools
+import pytransit.norm_tools as norm_tools
+from statsmodels.stats.multitest import fdrcorrection
+
+# this is a stand-alone script that uses the pytransit library to do resampling between all pairs of conditions in a combined_wig file
+# metadata file indicates which replicates belong to which conditions
+# put transit/src/ in your PYTHON_PATH (e.g. using export in bash, or setenv in csh)
+# prints output to stdout
+
+if len(sys.argv)!=4:
+  print("usage: python3 allpairs_resampling.py <combined_wig_file> <metadata_file> <prot_table>")
+  sys.exit(0)
+
+combined_wig_file = sys.argv[1]
+metadata_file = sys.argv[2]
+prot_table = sys.argv[3]
+
+#################################
+
+print("reading data")
+(sites, data, filenamesInCombWig) = tnseq_tools.read_combined_wig(combined_wig_file)
+print("data.shape =",data.shape)
+
+print("normalizing using TTR")
+(data, factors) = norm_tools.normalize_data(data, "TTR")
+
+# there is a tnseq_tools.read_samples_metadata(), but it is kind of complicated
+# so just read through tab-separated file and collect sample Filenames associated with Conditions
+# metadata files have at least 3 columns: Id, Filename, Condition
+# the columns in the combined_wig are cross-referenced by Filename (from the original wigs)
+
+Conditions,Samples = [],{}
+CondCol,FnameCol = -1,-1
+for line in open(metadata_file):
+  w = line.rstrip().split('\t') # tab-separated
+  if CondCol==-1: CondCol,FnameCol = w.index("Condition"),w.index("Filename"); continue # error if headers not found
+  cond,fname = w[CondCol],w[FnameCol]
+  if cond not in Conditions: Conditions.append(cond)
+  if cond not in Samples: Samples[cond] = []
+  Samples[cond].append(fname)
+
+print("\nConditions\t: Samples")
+print("-------------------------")
+for i,cond in enumerate(Conditions):
+  print("%s:%-8s" % (i+1,cond),"\t: ",", ".join(Samples[cond]))
+ 
+#################################
+
+genes = tnseq_tools.Genes(None,prot_table,data=data, position=sites) # divides counts at all TAs sites into groups by orf
+
+print()
+print("running resampling on each pair of conditions...")
+print("reporting number of conditionally essential genes (Qval<0.05)")
+for i,condA in enumerate(Conditions):
+  for j,condB in enumerate(Conditions):
+    if i<j:
+      pvals = []
+      for gene in genes:
+        if gene.n==0: continue # skip genes with 0 TA sites
+        idxA = [filenamesInCombWig.index(s) for s in Samples[condA]]
+        idxB = [filenamesInCombWig.index(s) for s in Samples[condB]]
+        data1 = gene.reads[idxA] # counts at TA sites in gene for wigs for condition A
+        data2 = gene.reads[idxB]
+        stats = stat_tools.resampling(data1, data2, adaptive=True) # use defaults for other params
+        (test_obs, mean1, mean2, log2FC, pval_ltail, pval_utail,  pval_2tail, testlist) = stats # unpack
+        pvals.append(pval_2tail)
+
+      qvals = fdrcorrection(pvals)[1]
+      numhits = sum([q<0.05 for q in qvals])
+      vals = ["%s:%-8s" % (i+1,condA),"%s:%-8s" % (j+1,condB),numhits]
+      print("\t".join([str(x) for x in vals]))
+


=====================================
src/pytransit/__init__.py
=====================================
@@ -2,6 +2,6 @@
 __all__ = ["transit_tools", "tnseq_tools", "norm_tools", "stat_tools"]
 
 
-__version__ = "v3.3.0"
+__version__ = "v3.3.1"
 prefix = "[TRANSIT]"
 


=====================================
src/pytransit/analysis/CGI.py
=====================================
@@ -125,19 +125,25 @@ class CGI_Method(base.SingleConditionMethod):
         cmd,args,kwargs = self.cmd,self.args,self.kwargs
 
         if cmd=="extract_counts":
-            if len(args)<2: print(self.usage_string())
+            if len(args)<2: 
+                print("You have provided incorrect number of args")
+                print(self.usage_string())
             fastq_file = args[0]
             ids_file = args[1]
             self.extract_counts(fastq_file, ids_file)
         
         elif cmd=="create_combined_counts":
-            if len(args)<2: print(self.usage_string())
+            if len(args)<2: 
+                print("You have provided incorrect number of args")
+                print(self.usage_string())
             headers = args[0].split(",")
             counts_file_list = args[1:]
             self.create_combined_counts(headers,counts_file_list)
 
         elif cmd=="extract_abund":
-            if len(args)<7: print(self.usage_string())
+            if len(args)<7: 
+                print("You have provided incorrect number of args")
+                print(self.usage_string())
             combined_counts_file = args[0]
             metadata_file = args[1]
             control_condition=args[2]
@@ -147,17 +153,23 @@ class CGI_Method(base.SingleConditionMethod):
             days = args[6]
             self.extract_abund(combined_counts_file,metadata_file,control_condition,sgRNA_strength_file,no_dep_abund,drug,days)
         elif cmd == "run_model":
-            if len(args)<1: print(self.usage_string())
+            if len(args)<1: 
+                print("You have provided incorrect number of args")
+                print(self.usage_string())
             ifile_path = args[0] #example frac_abund_RIF_D5.txt
             self.run_model(ifile_path)
         elif cmd == "visualize":
-            if len(args)<3: print(self.usage_string())
+            if len(args)<3: 
+                print("You have provided incorrect number of args")
+                print(self.usage_string())
             frac_abund_file= args[0]
             gene = args[1]
             fig_location = args[2]
             self.visualize(frac_abund_file, gene, fig_location)
             
-        else: print(self.usage_string())
+        else: 
+            print("You have not entered a valid command, here are the options")
+            print(self.usage_string())
 
     def reverse_complement(self, seq):
         complement = {'A':'T','T':'A','C':'G','G':'C'}
@@ -291,15 +303,16 @@ class CGI_Method(base.SingleConditionMethod):
         abund_df["sgRNA"] = abund_df.index.values.tolist()
         abund_df[["orf-gene","remaining"]] = abund_df["sgRNA"].str.split('_',n=1,expand=True)
         abund_df[["orf","gene"]]= abund_df["orf-gene"].str.split(':',expand=True)
-        abund_df = abund_df.drop(columns=["orf-gene","remaining","sgRNA"])
+        abund_df = abund_df.drop(columns=["orf-gene","remaining"])
         abund_df = abund_df.dropna()
         
         abund_df.insert(0, "sgRNA strength", abund_df.pop("sgRNA strength"))
         abund_df.insert(0, "uninduced ATC values", abund_df.pop("uninduced ATC values"))
         abund_df.insert(0, 'gene', abund_df.pop('gene'))
         abund_df.insert(0, 'orf', abund_df.pop('orf'))
+        abund_df.insert(0, 'sgRNA', abund_df.pop('sgRNA'))
 
-        abund_df_text = abund_df.to_csv(sep="\t")
+        abund_df_text = abund_df.to_csv(sep="\t", index=False)
         print(abund_df_text)
 
 


=====================================
src/pytransit/analysis/hmm.py
=====================================
@@ -230,12 +230,13 @@ class HMMMethod(base.SingleConditionMethod):
 
         #Get output path
         name = transit_tools.basename(ctrldata[0])
-        defaultFileName = "hmm_output.dat"
+        defaultFileName = "hmm_output_base"
         defaultDir = os.getcwd()
         output_path = wxobj.SaveFile(defaultDir, defaultFileName)
         if not output_path: return None
-        output_file = open(output_path, "w")
-
+        self.sites_fname = output_path+".sites.txt"
+        self.genes_fname = output_path+".genes.txt"
+        output_file = open(self.sites_fname, "w")
 
 
         return self(ctrldata,
@@ -255,7 +256,9 @@ class HMMMethod(base.SingleConditionMethod):
         ctrldata = args[0].split(",")
         annotationPath = args[1]
         outpath = args[2]
-        output_file = open(outpath, "w")
+        self.sites_fname = outpath+".sites.txt"
+        self.genes_fname = outpath+".genes.txt"
+        output_file = open(self.sites_fname, "w")
 
         replicates = kwargs.get("r", "Mean")
         normalization = kwargs.get("n", "TTR")
@@ -409,12 +412,12 @@ class HMMMethod(base.SingleConditionMethod):
 
         self.transit_message("") # Printing empty line to flush stdout 
         self.transit_message("Finished HMM - Sites Method")
-        self.transit_message("Adding File: %s" % (self.output.name))
+        self.transit_message("Adding File: %s" % (self.sites_fname))
         self.add_file(filetype="HMM - Sites")
         
         #Gene Files
         self.transit_message("Creating HMM Genes Level Output")
-        genes_path = ".".join(self.output.name.split(".")[:-1]) + "_genes." + self.output.name.split(".")[-1] 
+        genes_path = self.genes_fname
 
         tempObs = numpy.zeros((1,len(O)))
         tempObs[0,:] = O - 1
@@ -429,7 +432,8 @@ class HMMMethod(base.SingleConditionMethod):
 
     @classmethod
     def usage_string(self):
-        return """python3 %s hmm <comma-separated .wig files> <annotation .prot_table or GFF3> <output file>
+        return """python3 %s hmm <comma-separated .wig files> <annotation .prot_table or GFF3> <output_BASE_filename>
+        (will create 2 output files: BASE.sites.txt and BASE.genes.txt)
 
         Optional Arguments:
             -r <string>     :=  How to handle replicates. Sum, Mean. Default: -r Mean
@@ -556,8 +560,7 @@ class HMMMethod(base.SingleConditionMethod):
 
 
     def post_process_genes(self, data, position, states, output_path):
-
-        output = open(output_path, "w")
+        output = open(self.genes_fname, "w")
         pos2state = dict([(position[t],states[t]) for t in range(len(states))])
         theta = numpy.mean(data > 0)
         G = tnseq_tools.Genes(self.ctrldata, self.annotation_path, data=data, position=position, ignoreCodon=False, nterm=self.NTerminus, cterm=self.CTerminus)
@@ -610,17 +613,10 @@ class HMMMethod(base.SingleConditionMethod):
         output.close()
 
 
-
-    
-
-
-
-
 if __name__ == "__main__":
 
     (args, kwargs) = transit_tools.cleanargs(sys.argv)
 
-
     G = HMMMethod.fromargs(sys.argv[1:])
 
     G.console_message("Printing the member variables:")   


=====================================
src/pytransit/data/iron_combined_wig4.txt
=====================================
The diff for this file was not included because it is too large.

=====================================
src/pytransit/data/iron_samples_metadata.txt
=====================================
@@ -0,0 +1,15 @@
+Id	Filename	Condition
+HighFeMBT	HighFeMBT.wig	HighFeMBT
+HighFeMBT2	HighFeMBT2.wig	HighFeMBT
+HighFeMBT3	HighFeMBT3.wig	HighFeMBT
+LowFeMBT	LowFeMBT.wig	LowFeMBT
+LowFeMBT2	LowFeMBT2.wig	LowFeMBT
+FeCMBT	FeCMBT.wig	FeCMBT
+FeCMBT2	FeCMBT2b.wig	FeCMBT
+Hemin	Hemin.wig	Hemin
+Hemin2	Hemin2b.wig	Hemin
+Hemin3	Hemin3b.wig	Hemin
+Hemoglobin	Hemoglobin.wig	Hemoglobin
+Hemoglobin2	Hemoglobin2b.wig	Hemoglobin
+HeminMBT	HeminMBT.wig	HeminMBT
+HeminMBT2	HeminMBT2.wig	HeminMBT


=====================================
src/pytransit/doc/requirements.txt
=====================================
@@ -0,0 +1,2 @@
+sphinx-rtd-theme
+


=====================================
src/pytransit/doc/source/example_scripts.rst
=====================================
@@ -0,0 +1,244 @@
+Example Scripts (for Developers)
+===============
+
+|
+
+Here are two example python scripts for developers showing how to use the *pytransit* package.
+
+Example 1: src/pytransit_resampling.py
+~~~~~~~~~~~~~~~~~~~~~
+
+This is a stand-alone script that uses the pytransit library to do
+resampling between two groups of wig files, analogous to running
+"python3 transit.py resampling...".  It is a simplified version of
+src/pytransit/analysis/resampling.py, which has more options.  
+
+To run this, cd into your transit/src/ directory.  It prints output to
+stdout. **Be sure to put transit/src/ in your PYTHON_PATH** 
+(e.g. using export in bash, or setenv in csh).
+
+::
+
+  usage: python3 pytransit_resampling.py <comma_separated_wigs_condA> <comma_separated_wigs_condB> <prot_table>
+
+  > cd transit/src/
+  > python3 pytransit_resampling.py pytransit/data/glycerol_H37Rv_rep1.wig,pytransit/data/glycerol_H37Rv_rep2.wig pytransit/data/cholesterol_H37Rv_rep1.wig,pytransit/data/cholesterol_H37Rv_rep2.wig,pytransit/data/cholesterol_H37Rv_rep3.wig pytransit/genomes/H37Rv.prot_table 
+
+The important functions illustrated in this script are:
+
+ * **transit_tools.get_validated_data()**
+ * **norm_tools.normalize_data()**
+ * **stat_tools.resampling()**
+
+.. code-block:: python
+
+  # transit/src/pytransit_resampling.py
+  import sys,numpy
+  import pytransit.transit_tools as transit_tools
+  import pytransit.tnseq_tools as tnseq_tools
+  import pytransit.stat_tools as stat_tools
+  import pytransit.norm_tools as norm_tools
+  from statsmodels.stats.multitest import fdrcorrection
+  
+  if len(sys.argv)!=4:
+    print("usage: python3 pytransit_resampling.py <comma_separated_wigs_condA> <comma_separated_wigs_condB> <prot_table>")
+    sys.exit(0)
+  
+  wigsA = sys.argv[1].split(',')
+  wigsB = sys.argv[2].split(',')
+  nA = len(wigsA)
+  prot_table = sys.argv[3]
+  
+  (counts,sites) = transit_tools.get_validated_data(wigsA+wigsB) # data is a DxN numpy array, D=num datasets, N=num TA sites
+  print(counts.shape)
+
+  print("normalizing data with TTR")
+  (data, factors) = norm_tools.normalize_data(counts, "TTR")
+
+  # this divides the raw data (counts at all TA sites) into a smaller array of counts for each gene based on its coordinates
+ genes = tnseq_tools.Genes(None,prot_table,data=data, position=sites)
+
+  results,pvals = [],[]
+  for gene in genes:
+    ii = gene.reads # coordinates of TA sites in gene
+    if gene.n==0: continue # skip genes with 0 TA sites
+    data1 = gene.reads[:nA] # counts at TA sites in gene for wigs for condition A
+    data2 = gene.reads[nA:]
+    stats = stat_tools.resampling(data1, data2, adaptive=True) # use defaults for other params
+    (test_obs, mean1, mean2, log2FC, pval_ltail, pval_utail,  pval_2tail, testlist) = stats #     unpack
+    vals = [gene.orf,gene.name,gene.n,round(mean1,1),round(mean2,1),round(log2FC,3),pval_2tail]
+    results.append(vals)
+
+  # do multiple-tests correction
+  pvals = [x[-1] for x in results]
+  qvals = fdrcorrection(pvals)[1]
+  
+  print('\t'.join("orf gene numTA meanA meanB LFC Pval Qval".split()))
+  for vals,qval in zip(results,qvals):
+    print('\t'.join([str(x) for x in vals+[qval]]))
+  print("%s/%s hits (qval<0.05)" % (sum([q<0.05 for q in qvals]),len(results)))
+  
+
+
+
+Example 2: src/allpairs_resampling.py
+~~~~~~~~~~~~~~~~~~~~~
+
+While the example above shows how to read-in and process individual
+wig files, this examples shows how to work with :ref:`combined_wig and
+sample metadata files <combined_wig>`.  It does resampling between 
+each pair of conditions, and prints out a matrix of hits
+(statistically-signficant conditionally-essential genes).
+
+To run this, cd into your transit/src/ directory.  It prints output to
+stdout. **Be sure to put transit/src/ in your PYTHON_PATH** 
+(e.g. using export in bash, or setenv in csh).
+
+The important parts illustrated in this example script are: 
+
+ * the **tnseq_tools.read_combined_wig()** function
+ * how to read the metadata file
+ * selecting counts from the data matrix for the samples associated with each condition
+
+::
+
+  import sys,numpy
+  import pytransit.transit_tools as transit_tools
+  import pytransit.tnseq_tools as tnseq_tools
+  import pytransit.stat_tools as stat_tools
+  import pytransit.norm_tools as norm_tools
+  from statsmodels.stats.multitest import fdrcorrection
+  
+  # this is a stand-alone script that uses the pytransit library to do resampling between all pairs of conditions in a combined_wig file
+  # metadata file indicates which replicates belong to which conditions
+  # put transit/src/ in your PYTHON_PATH (e.g. using export in bash, or setenv in csh)
+  # prints output to stdout
+  
+  if len(sys.argv)!=4:
+    print("usage: python3 allpairs_resampling.py <combined_wig_file> <metadata_file> <prot_table>")
+    sys.exit(0)
+  
+  combined_wig_file = sys.argv[1]
+  metadata_file = sys.argv[2]
+  prot_table = sys.argv[3]
+  
+  #################################
+  
+  print("reading data")
+  (sites, data, filenamesInCombWig) = tnseq_tools.read_combined_wig(combined_wig_file)
+  print("data.shape =",data.shape)
+  
+  print("normalizing using TTR")
+  (data, factors) = norm_tools.normalize_data(data, "TTR")
+  
+  # there is a tnseq_tools.read_samples_metadata(), but it is kind of complicated
+  # so just read through tab-separated file and collect sample Filenames associated with Conditions
+  # metadata files have at least 3 columns: Id, Filename, Condition
+  # the columns in the combined_wig are cross-referenced by Filename (from the original wigs)
+  
+  Conditions,Samples = [],{}
+  CondCol,FnameCol = -1,-1
+  for line in open(metadata_file):
+    w = line.rstrip().split('\t') # tab-separated
+    if CondCol==-1: CondCol,FnameCol = w.index("Condition"),w.index("Filename"); continue # error if headers not found
+    cond,fname = w[CondCol],w[FnameCol]
+    if cond not in Conditions: Conditions.append(cond)
+    if cond not in Samples: Samples[cond] = []
+    Samples[cond].append(fname)
+  
+  print("\nConditions\t: Samples")
+  print("-------------------------")
+  for i,cond in enumerate(Conditions):
+    print("%s:%-8s" % (i+1,cond),"\t: ",", ".join(Samples[cond]))
+   
+  #################################
+  
+  genes = tnseq_tools.Genes(None,prot_table,data=data, position=sites) # divides counts at all TAs sites into groups by orf
+  
+  print()
+  print("running resampling on each pair of conditions...")
+  print("reporting number of conditionally essential genes (Qval<0.05)")
+  for i,condA in enumerate(Conditions):
+    for j,condB in enumerate(Conditions):
+      if i<j:
+        pvals = []
+        for gene in genes:
+          if gene.n==0: continue # skip genes with 0 TA sites
+          idxA = [filenamesInCombWig.index(s) for s in Samples[condA]]
+          idxB = [filenamesInCombWig.index(s) for s in Samples[condB]]
+          data1 = gene.reads[idxA] # counts at TA sites in gene for wigs for condition A
+          data2 = gene.reads[idxB]
+          stats = stat_tools.resampling(data1, data2, adaptive=True) # use defaults for other params
+          (test_obs, mean1, mean2, log2FC, pval_ltail, pval_utail,  pval_2tail, testlist) = stats # unpack
+          pvals.append(pval_2tail)
+  
+        qvals = fdrcorrection(pvals)[1]
+        numhits = sum([q<0.05 for q in qvals])
+        vals = ["%s:%-8s" % (i+1,condA),"%s:%-8s" % (j+1,condB),numhits]
+        print("\t".join([str(x) for x in vals]))
+  
+
+Here is the output of this script for data from growth
+of M. tuberculosis H37Rv on media containing iron supplied by
+different vehicles (e.g. mycobactin, carboxymycobactin, hemin,
+hemoglobin...), which requires genes in different pathways for uptake
+`(Zhang et al., 2020) <https://pubmed.ncbi.nlm.nih.gov/32069330/>`_.
+The raw data (wig files, with insertion counts at TA sites) have been
+combined into a **combined_wig file** and a **metatdata file** that
+describes which samples belong to which condition.  These files 
+(*iron_combined_wig4.txt* and *iron_samples_metadata.txt*) can be found in
+the transit data directory, transit/src/pytransit/data/.  You can
+also compare this to the heatmap shown on the page for :ref:`corrplot
+<corrplot>`.
+
+::
+
+  > cd transit/src/
+  > python3 allpairs_resampling.py pytransit/data/iron_combined_wig4.txt pytransit/data/iron_samples_metadata.txt pytransit/genomes/H37Rv.prot_table
+  reading data
+  data.shape = (14, 74605)
+  normalizing using TTR
+  
+  Conditions	: Samples
+  -------------------------
+  1:HighFeMBT 	:  HighFeMBT.wig, HighFeMBT2.wig, HighFeMBT3.wig
+  2:LowFeMBT 	:  LowFeMBT.wig, LowFeMBT2.wig
+  3:FeCMBT   	:  FeCMBT.wig, FeCMBT2b.wig
+  4:Hemin    	:  Hemin.wig, Hemin2b.wig, Hemin3b.wig
+  5:Hemoglobin 	:  Hemoglobin.wig, Hemoglobin2b.wig
+  6:HeminMBT 	:  HeminMBT.wig, HeminMBT2.wig
+  
+  running resampling on each pair of conditions...
+  reporting number of conditionally essential genes (Qval<0.05)
+  1:HighFeMBT	2:LowFeMBT	38
+  1:HighFeMBT	3:FeCMBT  	10
+  1:HighFeMBT	4:Hemin   	136
+  1:HighFeMBT	5:Hemoglobin	134
+  1:HighFeMBT	6:HeminMBT	30
+  2:LowFeMBT	3:FeCMBT  	35
+  2:LowFeMBT	4:Hemin   	70
+  2:LowFeMBT	5:Hemoglobin	71
+  2:LowFeMBT	6:HeminMBT	5
+  3:FeCMBT  	4:Hemin   	104
+  3:FeCMBT  	5:Hemoglobin	106
+  3:FeCMBT  	6:HeminMBT	44
+  4:Hemin   	5:Hemoglobin	4
+  4:Hemin   	6:HeminMBT	77
+  5:Hemoglobin	6:HeminMBT	89
+
+
+.. NOTE::
+
+  Note, in allpairs_resampling.py, the FDR correction 
+  to adjust P-values for testing all genes in parallel 
+  is applied within each pairwise comparison.
+  This correction is then repeated independently for each pair
+  of conditions analyzed.  Formally, it would be more rigorous to apply
+  the FDR correction one time at the end, to adjust P-values over all pairs and
+  over all genes (which would be G*N*(N-1)/2 probabilities, where G in
+  the number of genes in the genome, and N is the number of conditions).
+  This would likely further reduce the number of significant
+  conditionally-essential genes.  But for simplicity, this example
+  script does not do that, because it is just designed to illustrate
+  using the *pytransit* package.
+


=====================================
src/pytransit/doc/source/file_formats.rst
=====================================
@@ -0,0 +1,242 @@
+.. _input_files:
+
+File Formats for Transit
+=======================
+
+
+Wig Files
+---------
+
+TPP processes raw sequencing files (.fastq) to extract transposon insertion counts 
+at TA sites.
+These are stored as ".wig" files.
+Wig files primarily have 2 columns of data: a) coordinates of TA sites, 
+and b) the number of insertions observed at those sites.
+By convention, there are also 2 header lines.
+The first header line can be arbitrary info, such as where the data came from.
+The second has to contain "variableStep chrom=" followed by the name of the
+reference genome used for mapping the reads (e.g. H37Rv.fna).
+Here is an example (transit/src/pytransit/data/glycerol_H37Rv_rep2.wig):
+
+::
+  
+  # from Griffin et al, (2011).  PLOS Pathogens, e1002251.
+  variableStep chrom=H37Rv
+  60 0
+  72 0
+  102 0
+  ...
+  1522 124
+  1552 0
+  1635 201
+  1779 0
+  1782 0
+  1788 0
+  1847 0
+  1858 342
+  1921 21
+  ...
+  4411440 99
+  4411479 0
+  4411508 0
+  4411526 0
+
+  (full file has ~75,000 lines)
+
+|
+
+.. _annotation_files:
+
+Prot_Tables (Genome Annotations)
+------------------
+
+The annotation of a genome contains information about genes, such as
+coordinates, strand, locus tag, gene name, and functional description.
+Transit uses a custom format for annotations called "prot_table"s,
+e.g. H37Rv.prot_table.  Prot_tables are **tab-separated text files**
+containing the gene information in 9 specific columns:
+
+**Prot_table file format:**
+
+1. gene function description
+2. start coordinate
+3. end coordinate
+4. strand
+5. length of protein product (in amino acids)
+6. don't care
+7. don't care
+8. gene name (like "dnaA")
+9. ORF id (like Rv0001)
+
+Here is an example (transit/src/pytransit/genomes/H37Rv.prot_table):
+
+::
+
+  chromosomal replication initiation protein 	1	1524	+	507	15607143	885041	dnaA	Rv0001
+  DNA polymerase III subunit beta 	2052	3260	+	402	15607144	887092	dnaN	Rv0002
+  recombination protein F 	3280	4437	+	385	15607145	887089	recF	Rv0003
+  hypothetical protein Rv0004 	4434	4997	+	187	15607146	887088	-	Rv0004
+  DNA gyrase subunit B 	5123	7267	+	714	15607147	887081	gyrB	Rv0005
+  DNA gyrase subunit A 	7302	9818	+	838	15607148	887105	gyrA	Rv0006
+  ... 
+
+  (full file has ~4000 lines)
+  
+
+.. NOTE::
+
+  *It is crucial* that the annotation file (.prot_table) used for
+  analyses in Transit corresponds to exactly the same genome sequence
+  (.fasta or .fna) that was used to generate the .wig files with TPP,
+  because it is used to determine which TA sites are contained in which
+  genes (by coordinates). For example, H37Rv.fna is paired with
+  H37Rv.prot_table, both derived from GenBank sequence NC_000962.3.
+
+
+In many cases, users might often obtain annotations for their genome
+in **.gff (or .gff3)** file format, such as downloaded from NCBI.  .gff
+files contains essentially the same information about genes.  However,
+there is a bit more flexibility in the .gff file format (especially in
+the tags used in the right-most column), and the information about
+genes is not always encoded in a uniform way, making it difficult to
+use arbitrary .gff filess for analyses in Transit.  
+Therefore, there is a
+simple procedure in Transit to convert a .gff file to .prot_table
+format (via GUI or command-line).  This
+step only has to be done once, and then the .prot_table can be used
+for all subsequent analyses in Transit.
+(The routine specifically looks for the 'locus_tag', 'gene', and 'product'
+tags in info field of CDS records.)
+
+::
+
+  > python3 transit.py convert gff_to_prot_table <input.gff_file> <output.prot_table>
+
+
+
+.. _combined_wig:
+
+Combined_wig Files
+------------------
+
+Transit now supports a new file format called 'combined_wig' which basically
+combines multiple wig files into one file (with multiple columns).  This is
+especially useful for managing larger collections of datasets.
+
+Combined_wig files can created through the Transit GUI
+(in the menu: **File -> Export -> Selected Datasets -> to_Combined_Wig**), or via the command line:
+
+::
+
+  Usage:
+
+  python3 transit.py export combined_wig <comma-separated .wig files> <annotation .prot_table> <output file> [-n <norm>]
+
+  Example:
+
+  > cd src/pytransit/data/
+  > python3 ../../transit.py export combined_wig glycerol_H37Rv_rep1.wig,glycerol_H37Rv_rep2.wig,cholesterol_H37Rv_rep1.wig,cholesterol_H37Rv_rep2.wig,cholesterol_H37Rv_rep3.wig ../genomes/H37Rv.prot_table glyc_chol.combined_wig.txt
+
+
+You can specify the normalization method you want to use with a flag.
+TTR is the default, but other relevant normalization options would be 'nonorm'
+(i.e. preserve raw counts) and 'betageom' (this corrects for skew, but is slow).
+
+The format of a combined_wig is simply a multi-column (tab-separated) file with
+the first column being the coordinates of TA sites, followed by 
+N columns of counts (for N samples), possibly with a final column indicating
+the gene annotation information.
+A combined_wig file can have header lines, prefixed by '#'.
+
+Importantly, a combined_wig file must include sample identifiers
+(filenames) that are prefixed with **"#File: "**.  These header lines
+are automatically included by the 'export combined_wig' Transit
+command the creates a combined_wig from multiple .wig files.  These "#File:"
+header lines should be given in the same order as the sample columns,
+and the filenames are used as identifiers to cross-reference
+information about each sample in the metadata file (see below).
+
+Here is the output file from the example above (glyc_chol.combined_wig.txt):
+
+::
+
+ #Converted to CombinedWig with TRANSIT.
+ #normalization method: TTR
+ #Normalization Factors: 2.286197257382532 1.4371695843583265 2.6365107688705036 3.2329912278198036 2.865270203473558
+ #RefGenome: H37Rv
+ #File: glycerol_H37Rv_rep1.wig
+ #File: glycerol_H37Rv_rep2.wig
+ #File: cholesterol_H37Rv_rep1.wig
+ #File: cholesterol_H37Rv_rep2.wig
+ #File: cholesterol_H37Rv_rep3.wig
+ #TA_coord	glycerol_H37Rv_rep1.wig	glycerol_H37Rv_rep2.wig	cholesterol_H37Rv_rep1.wig	cholesterol_H37Rv_rep2.wig	cholesterol_H37Rv_rep3.wig
+ 60	0.0	0.0	0.0	0.0	0.0	Rv0001 (dnaA)
+ 72	0.0	0.0	0.0	0.0	0.0	Rv0001 (dnaA)
+ ...
+ 1345	0.0	0.0	0.0	0.0	0.0	Rv0001 (dnaA)
+ 1423	0.0	0.0	0.0	0.0	0.0	Rv0001 (dnaA)
+ 1522	0.0	178.2	0.0	0.0	0.0	Rv0001 (dnaA)
+ 1552	61.7	0.0	0.0	29.1	0.0	
+ 1635	32.0	288.9	142.4	22.6	0.0	
+ ...
+
+ (full file has ~75,000 lines)
+
+(DnaA is essential, which is why there are no insertion counts in the first few TA sites)
+
+Note that the ORF id and gene names are appended for TA sites in CDS regions, for convenience.
+If you open a combined_wig file in Excel (with tab as separator character), you will
+see the last line of the header (starting with "TAcoord...") provides the input wig filenames as column headers.
+
+The '#RefGenome' is extracted from information in the individual .wig files,
+which records the reference genome sequence that was used with TPP; 
+the coordinates of TA sites are determined from the genome sequence.
+The **reference genome must be the same for all wigs** being combined.
+
+
+|
+
+
+.. _metadata_files:
+
+Samples Metadata File
+---------------------
+
+The **metadata** file describes the sample ids, filenames,
+and conditions they represent (e.g. different media, growth
+conditions, knockout strains, animal passaging, etc., or whatever
+treatments and controls your TnSeq experiment involves) for all the
+samples in a combined_wig file.  
+
+Format of the *samples_metadata* file: a tab-separated file (which you
+can edit in Excel) with 3 columns: Id, Condition, and Filename (it
+must have these headers). The Condition column should be as specific
+as possible, indicating **how to group replicates**.
+You can include other columns of info, but
+do not include additional rows.  Individual rows can be commented out
+by prefixing them with a '#'.  Here is an example of a samples
+metadata file: The filenames should match what is shown in the header
+of the combined_wig (including pathnames, if present).
+
+Note: the Condition column should have a unique label for each
+distinct condition (the same label shared only among replicates).  If
+there are attributes that distinguish the conditions (such as strain,
+treatment, etc), they could be included as **additional columns**
+(e.g. covariates, like Carbon_source or Drug or Days or Strain...).
+
+::
+
+  Id      Condition    Filename
+  glyc1   glycerol     /Users/example_data/glycerol_rep1.wig
+  glyc2   glycerol     /Users/example_data/glycerol_rep2.wig
+  chol1   cholesterol  /Users/example_data/cholesterol_rep1.wig
+  chol2   cholesterol  /Users/example_data/cholesterol_rep2.wig
+  chol2   cholesterol  /Users/example_data/cholesterol_rep3.wig
+
+
+
+
+|
+
+.. rst-class:: transit_sectionend
+----


=====================================
src/pytransit/doc/source/index.rst
=====================================
@@ -79,6 +79,7 @@ TRANSIT offers a variety of features including:
    transit_install
    transit_running
    transit_features
+   file_formats
    
 .. toctree::
    :maxdepth: 3
@@ -135,6 +136,7 @@ TRANSIT offers a variety of features including:
    :caption: Code Documentation
 
    transit
+   example_scripts
 
 * :ref:`genindex`
 * :ref:`modindex`


=====================================
src/pytransit/doc/source/method_HMM.rst
=====================================
@@ -20,13 +20,15 @@ How does it work?
 |
 
 
-Example
--------
+Usage
+-----
 
 ::
 
 
-  python3 transit.py hmm <comma-separated .wig files> <annotation .prot_table or GFF3> <output file>
+  > python3 transit.py hmm <comma-separated .wig files> <annotation .prot_table or GFF3> <output_BASE_filename>
+        (will create 2 output files: BASE.sites.txt and BASE.genes.txt)
+
         Optional Arguments:
             -r <string>     :=  How to handle replicates. Sum, Mean. Default: -r Mean
             -l              :=  Perform LOESS Correction; Helps remove possible genomic position bias. Default: Off.
@@ -52,7 +54,7 @@ replicate datasets:
 Output and Diagnostics
 ----------------------
 
-| The HMM method outputs two files. The first file provides the most
+| The HMM method outputs two files. The first file (**BASE.sites.txt**) provides the most
   likely assignment of states for all the TA sites in the genome. Sites
   can belong to one of the following states: "E" (Essential), "GD"
   (Growth-Defect), "NE" (Non-Essential), or "GA" (Growth-Advantage). In
@@ -81,7 +83,7 @@ Output and Diagnostics
 +------------+-----------------------------------------------------------------------------------------------------+
 
 |
-|  The second file provides a gene-level classification for all the
+|  The second file (**BASE.genes.txt**) provides a gene-level classification for all the
   genes in the genome. Genes are classified as "E" (Essential), "GD"
   (Growth-Defect), "NE" (Non-Essential), or "GA" (Growth-Advantage)
   depending on the number of sites within the gene that belong to those


=====================================
src/pytransit/doc/source/method_anova.rst
=====================================
@@ -46,25 +46,16 @@ Example
 The output file generated by ANOVA identifies which genes exhibit statistically
 significant variability in counts across conditions (see Output and Diagnostics below).
 
-Note: the combined_wig input file can be generated from multiple wig
-files through the Transit GUI
-(File->Export->Selected_Datasets->Combined_wig), or via the 'export'
-command on the command-line (see combined_wig_).
 
-Format of the **samples metadata file**: a tab-separated file (which you can edit in Excel)
-with 3 columns: Id, Condition, and Filename (it must have these headers).  You can include
-other columns of info, but do not include additional rows.  Individual rows can be
-commented out by prefixing them with a '#'.  Here is an example of a samples metadata file:
-The filenames should match what is shown in the header of the combined_wig (including pathnames, if present).
+Input Files
+-----------
 
-::
+Anova takes a **combined_wig file** and **metadata file** as input.
+The combined_wig file contains insertion counts for multiple conditions and replicates.
+The metadata file indicates which replicates go with which experimental conditions.
+
+See description of :ref:`combined_wig and metadata file formats <combined_wig>`.
 
-  ID      Condition    Filename
-  glyc1   glycerol     /Users/example_data/glycerol_rep1.wig
-  glyc2   glycerol     /Users/example_data/glycerol_rep2.wig
-  chol1   cholesterol  /Users/example_data/cholesterol_rep1.wig
-  chol2   cholesterol  /Users/example_data/cholesterol_rep2.wig
-  chol2   cholesterol  /Users/example_data/cholesterol_rep3.wig
 
 
 


=====================================
src/pytransit/doc/source/method_resampling.rst
=====================================
@@ -189,11 +189,12 @@ making it difficult to make confident calls on essentiality.
 Doing resampling with a combined_wig file
 -----------------------------------------
 
-Resampling can also now take a combined_wig_ file as input (containing insertion counts
-for multiple sample), along with a samples_metadata_ file
+Resampling can also now take a **combined_wig** file as input (containing insertion counts
+for multiple sample), along with a **samples_metadata** file
 that describes the samples. This mode is indicated with a '-c' flag.
 If you want to compare more than two conditions, see :ref:`ZINB <zinb>`.
 
+See description of :ref:`combined_wig and metadata file formats <combined_wig>`.
 
 ::
 


=====================================
src/pytransit/doc/source/method_zinb.rst
=====================================
@@ -61,55 +61,16 @@ Example
         --gene <Orf id or Gene name>:= Run method for one gene and print model output.
 
 
-.. _combined_wig:
 
-Combined wig files
-------------------
+Input Files
+-----------
 
-Transit now supports a new file format called 'combined_wig' which basically
-combines multiple wig files into one file (with multiple columns).  This is
-used for some of the new analysis methods for larger collections of datasets, like :ref:`Anova <anova>`, :ref:`ZINB <zinb>`.
-Combined_wig files can created through the Transit GUI
-(File->Export->Selected_Datasets->Combined_wig), or via the command line.
-You can specify the normalization method you want to use with a flag.
-TTR is the default, but other relevant normalization options would be 'nonorm'
-(i.e. preserve raw counts) and 'betageom' (this corrects for skew, but is slow).
+ZINB takes a **combined_wig file** and **metadata file** as input.
+The combined_wig file contains insertion counts for multiple conditions and replicates.
+The metadata file indicates which replicates go with which experimental conditions.
 
+See description of :ref:`combined_wig and metadata file formats <combined_wig>`.
 
-::
-
-  > python3 src/transit.py export combined_wig --help
-
-  usage: python3 src/transit.py export combined_wig <comma-separated .wig files> <annotation .prot_table> <output file>
-
-  > python3 ../transit/src/transit.py export combined_wig Rv_1_H37RvRef.wig,Rv_2_H37RvRef.wig,Rv_3_H37RvRef.wig H37Rv.prot_table clinicals_combined_TTR.wig -n TTR
-
-
-
-.. _samples_metadata:
-
-Samples Metadata File
----------------------
-
-Format of the *samples_metadata* file: a tab-separated file (which you
-can edit in Excel) with 3 columns: Id, Condition, and Filename (it
-must have these headers).  You can include other columns of info, but
-do not include additional rows.  Individual rows can be commented out
-by prefixing them with a '#'.  Here is an example of a samples
-metadata file: The filenames should match what is shown in the header
-of the combined_wig (including pathnames, if present).
-
-Note: the Condition column should have a unique label for each distinct condition (the same label shared only among replicates).
-If there are attributes that distinguish the conditions (such as strain, treatment, etc), they could be included as additional columns (e.g. covariates).
-
-::
-
-  ID      Condition    Filename
-  glyc1   glycerol     /Users/example_data/glycerol_rep1.wig
-  glyc2   glycerol     /Users/example_data/glycerol_rep2.wig
-  chol1   cholesterol  /Users/example_data/cholesterol_rep1.wig
-  chol2   cholesterol  /Users/example_data/cholesterol_rep2.wig
-  chol2   cholesterol  /Users/example_data/cholesterol_rep3.wig
 
 Parameters
 ----------


=====================================
src/pytransit/doc/source/transit.rst
=====================================
@@ -1,4 +1,4 @@
-transit package
+pytransit package
 ===============
 
 |


=====================================
src/pytransit/doc/source/transit_install.rst
=====================================
@@ -246,6 +246,22 @@ wxPython 4+ can be installed using pip
 If the above command fails and you already have wxPython < 4.0 installed, you may have to manually remove it.
 See https://stackoverflow.com/questions/50688630/cannot-uninstall-wxpython-3-0-2-0-macos for details.
 
+.. NOTE::
+
+  Installing *wxPython* can be a bit finicky.  It might require installing the
+  development version of GTK first.  There are at least two versions currently, 
+  *gtk2* and *gtk3*.
+  Transit should work with both, although there can be small differences in the 
+  visual look of the GUI.  To get *wxPython* to install, you might try doing this:
+
+    > sudo apt-get install libgtk-2-dev
+
+    or
+
+    > sudo apt-get install libgtk-3-dev
+
+  depending on which version of *libgtk* you have installed.
+
 .. _transit-troubleshoot:
 
 Troubleshooting


=====================================
src/pytransit_resampling.py
=====================================
@@ -0,0 +1,48 @@
+import sys,numpy
+import pytransit.transit_tools as transit_tools
+import pytransit.tnseq_tools as tnseq_tools
+import pytransit.stat_tools as stat_tools
+import pytransit.norm_tools as norm_tools
+from statsmodels.stats.multitest import fdrcorrection
+
+# this is a stand-alone script that uses the pytransit library to do resampling between 2 groups of wig files
+# analogous to running "python3 transit.py resampling..."; see also src/pytransit/analysis/resampling.py
+# put transit/src/ in your PYTHON_PATH (e.g. using export in bash, or setenv in csh)
+# prints output to stdout
+
+if len(sys.argv)!=4:
+  print("usage: python3 pytransit_resampling.py <comma_separated_wigs_condA> <comma_separated_wigs_condB> <prot_table>")
+  sys.exit(0)
+
+wigsA = sys.argv[1].split(',')
+wigsB = sys.argv[2].split(',')
+nA = len(wigsA)
+prot_table = sys.argv[3]
+
+(counts,sites) = transit_tools.get_validated_data(wigsA+wigsB) # data is a DxN numpy array, D=num datasets, N=num TA sites
+print("counts.shape =",counts.shape)
+
+print("normalizing data with TTR")
+(data, factors) = norm_tools.normalize_data(counts, "TTR")
+
+# this divides the raw data (counts at all TA sites) into a smaller array of counts for each gene based on its coordinates
+genes = tnseq_tools.Genes(None,prot_table,data=data, position=sites)
+
+results,pvals = [],[]
+for gene in genes:
+  if gene.n==0: continue # skip genes with 0 TA sites
+  data1 = gene.reads[:nA] # counts at TA sites in gene for wigs for condition A
+  data2 = gene.reads[nA:]
+  stats = stat_tools.resampling(data1, data2, adaptive=True) # use defaults for other params
+  (test_obs, mean1, mean2, log2FC, pval_ltail, pval_utail,  pval_2tail, testlist) = stats # unpack
+  vals = [gene.orf,gene.name,gene.n,round(mean1,1),round(mean2,1),round(log2FC,3),pval_2tail]
+  results.append(vals)
+
+# do FDR correction
+pvals = [x[-1] for x in results]
+qvals = fdrcorrection(pvals)[1]
+
+print('\t'.join("orf gene numTA meanA meanB LFC Pval Qval".split()))
+for vals,qval in zip(results,qvals):
+  print('\t'.join([str(x) for x in vals+[qval]]))
+print("%s/%s hits (qval<0.05)" % (sum([q<0.05 for q in qvals]),len(results)))


=====================================
tests/test_analysis_methods.py
=====================================
@@ -63,8 +63,8 @@ class TestMethods(TransitTestCase):
         args = [mini_wig, small_annotation, output]
         G = HMMMethod.fromargs(args)
         G.Run()
-        self.assertTrue(os.path.exists(output))
-        genes_path = output.rsplit(".", 1)[0] + "_genes." + output.rsplit(".", 1)[1]
+        self.assertTrue(os.path.exists(output+".sites.txt"))
+        genes_path = output+".genes.txt"
         self.assertTrue(os.path.exists(genes_path))
 
 



View it on GitLab: https://salsa.debian.org/med-team/tnseq-transit/-/compare/898cc600b276228ce95ff463aa822e85dffdcaa1...2e42512a7f9c064d3c98422b3137515abb87851a

-- 
View it on GitLab: https://salsa.debian.org/med-team/tnseq-transit/-/compare/898cc600b276228ce95ff463aa822e85dffdcaa1...2e42512a7f9c064d3c98422b3137515abb87851a
You're receiving this email because of your account on salsa.debian.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20231024/e50b0285/attachment-0001.htm>


More information about the debian-med-commit mailing list