[med-svn] [Git][med-team/tnseq-transit][master] 6 commits: Fix watchfile to detect new versions on github
Andreas Tille (@tille)
gitlab at salsa.debian.org
Tue Oct 12 15:30:22 BST 2021
Andreas Tille pushed to branch master at Debian Med / tnseq-transit
Commits:
2ff20610 by Andreas Tille at 2021-10-12T16:15:09+02:00
Fix watchfile to detect new versions on github
- - - - -
2d811ab5 by Andreas Tille at 2021-10-12T16:15:40+02:00
routine-update: New upstream version
- - - - -
bd6c5a95 by Andreas Tille at 2021-10-12T16:15:41+02:00
New upstream version 3.2.2
- - - - -
7f1f89a1 by Andreas Tille at 2021-10-12T16:18:43+02:00
Update upstream source from tag 'upstream/3.2.2'
Update to upstream version '3.2.2'
with Debian dir 394017818186cb57bcd30b5cf259cc4b0dd6e7e4
- - - - -
611941a1 by Andreas Tille at 2021-10-12T16:18:43+02:00
routine-update: Standards-Version: 4.6.0
- - - - -
ee0f9a77 by Andreas Tille at 2021-10-12T16:29:34+02:00
routine-update: Ready to upload to unstable
- - - - -
17 changed files:
- CHANGELOG.md
- debian/changelog
- debian/control
- debian/watch
- setup.py
- src/pytransit/__init__.py
- src/pytransit/__main__.py
- src/pytransit/analysis/anova.py
- src/pytransit/analysis/gi.py
- src/pytransit/analysis/heatmap.py
- src/pytransit/analysis/pathway_enrichment.py
- src/pytransit/analysis/tn5gaps.py
- src/pytransit/analysis/zinb.py
- src/pytransit/convert/gff_to_prot_table.py
- src/pytransit/doc/source/conf.py
- src/pytransit/doc/source/index.rst
- src/pytransit/doc/source/transit_methods.rst
Changes:
=====================================
CHANGELOG.md
=====================================
@@ -2,6 +2,18 @@
All notable changes to this project will be documented in this file.
+
+## Version 3.2.2 2022-09-08
+#### TRANSIT:
+ - fixed bug in converting gff_to_prot_table
+ - fixed bug in tn5gaps (fixes some false negative calls)
+ - fixed some bugs in pathway_enrichment (GSEA calculations)
+ - fixed links to Salmonella Tn5 data in docs
+ - fixed problem with margins in heatmap.py that was causing R to fail
+ - added --ref to anova.py and zinb.py (for computing LFCs relative to designated reference condition)
+ - added --low_mean_filter for heatmap.py (for excluding genes with low counts, even if they are significant by ANOVA or ZINB)
+ - add dependency on pypubsub<4.0
+
## Version 3.2.1 2020-12-22
#### TRANSIT:
- maintenance release
=====================================
debian/changelog
=====================================
@@ -1,3 +1,11 @@
+tnseq-transit (3.2.2-1) unstable; urgency=medium
+
+ * Fix watchfile to detect new versions on github
+ * New upstream version
+ * Standards-Version: 4.6.0 (routine-update)
+
+ -- Andreas Tille <tille at debian.org> Tue, 12 Oct 2021 16:19:02 +0200
+
tnseq-transit (3.2.1-2) unstable; urgency=medium
* d/p/fix_problematic_comparison.patch (Closes: #984958)
=====================================
debian/control
=====================================
@@ -14,7 +14,7 @@ Build-Depends: debhelper-compat (= 13),
python3-statsmodels,
python3-pubsub,
bwa
-Standards-Version: 4.5.1
+Standards-Version: 4.6.0
Vcs-Browser: https://salsa.debian.org/med-team/tnseq-transit
Vcs-Git: https://salsa.debian.org/med-team/tnseq-transit.git
Homepage: http://saclab.tamu.edu/essentiality/transit/
=====================================
debian/watch
=====================================
@@ -1,3 +1,3 @@
version=4
-https://github.com/mad-lab/transit/releases .*/archive/v(\d[\d.-]+)\.(?:tar(?:\.gz|\.bz2)?|tgz)
+https://github.com/mad-lab/transit/releases .*/v(\d[\d.-]+)\.(?:tar(?:\.gz|\.bz2)?|tgz)
=====================================
setup.py
=====================================
@@ -148,7 +148,9 @@ setup(
# your project is installed. For an analysis of "install_requires" vs pip's
# requirements files see:
# https://packaging.python.org/en/latest/requirements.html
- install_requires=['setuptools', 'numpy~=1.16', 'scipy~=1.2', 'matplotlib~=3.0', 'pillow~=6.0', 'statsmodels~=0.9'],
+ # 'pypubsub<4.0' and 'wxPython' are needed for GUI only, but go ahead and install them
+ # the reason for restriction on pypubsub is that version>=4.0 does not work with python2 - I can probably get rid of this restriction, since everybody must be using python3 by now
+ install_requires=['setuptools', 'numpy~=1.16', 'scipy~=1.2', 'matplotlib~=3.0', 'pillow~=6.0', 'statsmodels~=0.9', 'pypubsub<4.0', 'wxPython'],
#dependency_links = [
# "git+https://github.com/wxWidgets/wxPython.git#egg=wxPython"
=====================================
src/pytransit/__init__.py
=====================================
@@ -2,6 +2,6 @@
__all__ = ["transit_tools", "tnseq_tools", "norm_tools", "stat_tools"]
-__version__ = "v3.2.1"
+__version__ = "v3.2.2"
prefix = "[TRANSIT]"
=====================================
src/pytransit/__main__.py
=====================================
@@ -91,11 +91,9 @@ def main(*args, **kwargs):
# Tried GUI mode but has no wxPython
elif not (args or kwargs) and not hasWx:
- print("Please install wxPython to run in GUI Mode.")
- print("To run in Console Mode please follow these instructions:")
+ print("Please install wxPython to run in GUI Mode. (pip install wxPython)")
print("")
- print("Usage: python %s <method>" % sys.argv[0])
- print("List of known methods:")
+ print("To run in Console Mode, try 'transit <method>' with one of the following methods:")
for m in methods:
print("\t - %s" % m)
# Running in Console mode
=====================================
src/pytransit/analysis/anova.py
=====================================
@@ -34,10 +34,11 @@ class AnovaMethod(base.MultiConditionMethod):
"""
anova
"""
- def __init__(self, combined_wig, metadata, annotation, normalization, output_file, ignored_conditions=[], included_conditions=[], nterm=0.0, cterm=0.0, PC=1):
+ def __init__(self, combined_wig, metadata, annotation, normalization, output_file, ignored_conditions=[], included_conditions=[], nterm=0.0, cterm=0.0, PC=1, refs=[]):
base.MultiConditionMethod.__init__(self, short_name, long_name, short_desc, long_desc, combined_wig, metadata, annotation, output_file,
normalization=normalization, ignored_conditions=ignored_conditions, included_conditions=included_conditions, nterm=nterm, cterm=cterm)
self.PC = PC
+ self.refs = refs
@classmethod
def transit_error(self,msg): print("error: %s" % msg) # for some reason, transit_error() in base class or transit_tools doesn't work right; needs @classmethod
@@ -58,18 +59,20 @@ class AnovaMethod(base.MultiConditionMethod):
NTerminus = float(kwargs.get("iN", 0.0))
CTerminus = float(kwargs.get("iC", 0.0))
PC = int(kwargs.get("PC", 5))
+ refs = kwargs.get("-ref",[]) # list of condition names to use a reference for calculating LFCs
+ if refs!=[]: refs = refs.split(',')
ignored_conditions = list(filter(None, kwargs.get("-ignore-conditions", "").split(",")))
included_conditions = list(filter(None, kwargs.get("-include-conditions", "").split(",")))
# check for unrecognized flags
- flags = "-n --ignore-conditions --include-conditions -iN -iC -PC".split()
+ flags = "-n --ignore-conditions --include-conditions -iN -iC -PC --ref".split()
for arg in rawargs:
if arg[0]=='-' and arg not in flags:
self.transit_error("flag unrecognized: %s" % arg)
print(AnovaMethod.usage_string())
sys.exit(0)
- return self(combined_wig, metadata, annotation, normalization, output_file, ignored_conditions, included_conditions, NTerminus, CTerminus, PC)
+ return self(combined_wig, metadata, annotation, normalization, output_file, ignored_conditions, included_conditions, NTerminus, CTerminus, PC, refs)
def wigs_to_conditions(self, conditionsByFile, filenamesInCombWig):
"""
@@ -171,8 +174,9 @@ class AnovaMethod(base.MultiConditionMethod):
p[rv],q[rv],statusMap[rv] = pvals[i],qvals[i],status[i]
return (p, q, statusMap)
- def calcLFCs(self,means,PC=1):
- grandmean = numpy.mean(means)
+ def calcLFCs(self,means,refs=[],PC=1):
+ if len(refs)==0: refs = means # if ref condition(s) not explicitly defined, use mean of all
+ grandmean = numpy.mean(refs)
lfcs = [math.log((x+PC)/float(grandmean+PC),2) for x in means]
return lfcs
@@ -217,7 +221,8 @@ class AnovaMethod(base.MultiConditionMethod):
Rv = gene["rv"]
if Rv in MeansByRv:
means = [MeansByRv[Rv][c] for c in conditionsList]
- LFCs = self.calcLFCs(means,self.PC)
+ refs = [MeansByRv[Rv][c] for c in self.refs]
+ LFCs = self.calcLFCs(means,refs,self.PC)
vals = ([Rv, gene["gene"], str(len(RvSiteindexesMap[Rv]))] +
["%0.2f" % x for x in means] +
["%0.3f" % x for x in LFCs] +
@@ -234,6 +239,7 @@ class AnovaMethod(base.MultiConditionMethod):
-n <string> := Normalization method. Default: -n TTR
--include-conditions <cond1,...> := Comma-separated list of conditions to use for analysis (Default: all)
--ignore-conditions <cond1,...> := Comma-separated list of conditions to ignore (Default: none)
+ --ref <cond> := which condition(s) to use as a reference for calculating LFCs (comma-separated if multiple conditions)
-iN <N> := Ignore TAs within given percentage (e.g. 5) of N terminus. Default: -iN 0
-iC <N> := Ignore TAs within given percentage (e.g. 5) of C terminus. Default: -iC 0
-PC <N> := pseudocounts to use for calculating LFC. Default: -PC 5"""
=====================================
src/pytransit/analysis/gi.py
=====================================
@@ -824,7 +824,7 @@ class GIMethod(base.QuadConditionMethod):
data.sort(key=lambda x: x[probcol])
sortedprobs = numpy.array([x[probcol] for x in data])
- # BFDR method: Newton M.A., Noueiry A., Sarkar D., Ahlquist P. (2004). Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics, 5:155–176.
+ # BFDR method: Newton et al (2004). Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics, 5:155-176.
if self.signif=="BFDR":
sortedprobs = numpy.array(sortedprobs)
=====================================
src/pytransit/analysis/heatmap.py
=====================================
@@ -112,6 +112,7 @@ class HeatmapMethod(base.SingleConditionMethod):
self.outfile = args[1]
self.qval = float(kwargs.get("qval",0.05))
self.topk = int(kwargs.get("topk",-1))
+ self.low_mean_filter = int(kwargs.get("low_mean_filter",5)) # filter out genes with grandmean<5 by default
return self(self.infile,outfile=self.outfile)
def Run(self):
@@ -136,14 +137,18 @@ class HeatmapMethod(base.SingleConditionMethod):
headers = headers[3:3+n]
headers = [x.replace("Mean_","") for x in headers]
else:
- lfcs = [float(x) for x in w[3+n:3+n+n]] # take just the columns of means
+ means = [float(x) for x in w[3:3+n]] # take just the columns of means
+ lfcs = [float(x) for x in w[3+n:3+n+n]] # take just the columns of LFCs
qval = float(w[-2])
- data.append((w,lfcs,qval))
+ data.append((w,means,lfcs,qval))
data.sort(key=lambda x: x[-1])
hits,LFCs = [],[]
- for k,(w,lfcs,qval) in enumerate(data):
- if (self.topk==-1 and qval<self.qval) or (self.topk!=-1 and k<self.topk): hits.append(w); LFCs.append(lfcs)
+ for k,(w,means,lfcs,qval) in enumerate(data):
+ if (self.topk==-1 and qval<self.qval) or (self.topk!=-1 and k<self.topk):
+ mm = round(numpy.mean(means),1)
+ if mm<self.low_mean_filter: print("excluding %s/%s, mean(means)=%s" % (w[0],w[1],mm))
+ else: hits.append(w); LFCs.append(lfcs)
print("heatmap based on %s genes" % len(hits))
genenames = ["%s/%s" % (w[0],w[1]) for w in hits]
@@ -168,8 +173,9 @@ H = 300+R*15
png(outfilename,width=W,height=H)
#defaults are lwid=lhei=c(1.5,4)
-heatmap.2(as.matrix(lfcs),col=colors,margin=c(12,12),lwid=c(2,6),lhei=c(0.1,2),trace="none",cexCol=1.4,cexRow=1.4,key=T) # make sure white=0
-heatmap.2(as.matrix(lfcs),col=colors,margin=c(12,12),lwid=c(2,6),lhei=c(0.1,2),trace="none",cexCol=1.4,cexRow=1.4,key=T) # make sure white=0
+#heatmap.2(as.matrix(lfcs),col=colors,margin=c(12,12),lwid=c(2,6),lhei=c(0.1,2),trace="none",cexCol=1.4,cexRow=1.4,key=T) # make sure white=0
+#heatmap.2(as.matrix(lfcs),col=colors,margin=c(12,12),trace="none",cexCol=1.2,cexRow=1.2,key=T) # make sure white=0 # setting margins was causing failures, so remove it 8/22/21
+heatmap.2(as.matrix(lfcs),col=colors,margin=c(12,12),trace="none",cexCol=1.2,cexRow=1.2,key=T) # actually, margins was OK, so the problem must have been with lhei and lwid
dev.off()
}
''')
@@ -177,7 +183,7 @@ dev.off()
@classmethod
def usage_string(self):
- return "usage: python3 %s heatmap <anova_or_zinb_output> <heatmap.png> -anova|-zinb [-topk <int>] [-qval <float>]\n note: genes are selected based on qval<0.05 by default" % sys.argv[0]
+ return "usage: python3 %s heatmap <anova_or_zinb_output> <heatmap.png> -anova|-zinb [-topk <int>] [-qval <float>] [-low_mean_filter <int>]\n note: genes are selected based on qval<0.05 by default" % sys.argv[0]
if __name__ == "__main__":
=====================================
src/pytransit/analysis/pathway_enrichment.py
=====================================
@@ -160,11 +160,14 @@ Optional parameters:
# assume these are listed as pairs (tab-sep)
# return bidirectional hash (genes->[terms], terms->[genes]; each can be one-to-many, hence lists)
- def read_associations(self,filename):
+ # filter could be a subset of genes we want to focus on (throw out the rest)
+
+ def read_associations(self,filename,filter=None):
associations = {}
for line in open(filename):
if line[0]=='#': continue
w = line.rstrip().split('\t')
+ if filter!=None and w[0] not in filter: continue # skip genes in association file that are not relevant (i.e. not in resampling file)
# store mappings in both directions
for (a,b) in [(w[0],w[1]),(w[1],w[0])]:
if a not in associations: associations[a] = []
@@ -218,8 +221,11 @@ Optional parameters:
def GSEA(self):
data,hits,headers = self.read_resampling_file(self.resamplingFile) # hits are not used in GSEA()
+ orfs_in_resampling_file = [w[0] for w in data]
headers = headers[-1].rstrip().split('\t')
- associations = self.read_associations(self.associationsFile)
+ associations = self.read_associations(self.associationsFile,filter=orfs_in_resampling_file) # bidirectional map; includes term->genelist and gene->termlist
+ # filter: project associations (of orfs to pathways) onto only those orfs appearing in the resampling file
+
ontology = self.read_pathways(self.pathwaysFile)
genenames = {}
for gene in data: genenames[gene[0]] = gene[1]
@@ -235,14 +241,18 @@ Optional parameters:
# rank by SLPV=sign(LFC)*log10(pval)
# note: genes with lowest p-val AND negative LFC have highest scores (like positive correlation)
# there could be lots of ties with pval=0 or 1, but that's OK
- LFC_col = headers.index("log2FC")
- Pval_col = headers.index("p-value")
pairs = [] # pair are: rv and score (SLPV)
for w in data:
- orf,LFC,Pval = w[0],float(w[LFC_col]),float(w[Pval_col])
- SLPV = (-1 if LFC<0 else 1)*math.log(Pval+0.000001,10)
- if self.ranking=="SLPV": pairs.append((orf,SLPV))
- elif self.ranking=="LFC": pairs.append((orf,LFC))
+ orf = w[0]
+ if self.ranking=="SLPV":
+ Pval_col = headers.index("p-value")
+ Pval = float(w[Pval_col])
+ SLPV = (-1 if LFC<0 else 1)*math.log(Pval+0.000001,10)
+ pairs.append((orf,SLPV))
+ elif self.ranking=="LFC":
+ LFC_col = headers.index("log2FC")
+ LFC = float(w[LFC_col])
+ pairs.append((orf,LFC))
# pre-randomize ORFs, to avoid genome-position bias in case of ties in pvals (e.g. 1.0)
indexes = range(len(pairs))
@@ -260,23 +270,25 @@ Optional parameters:
for i,term in enumerate(terms):
sys.stdout.flush()
orfs = terms2orfs.get(term,[])
- if len(orfs)<=1: continue
+ num_genes_in_pathway = len(orfs)
+ if num_genes_in_pathway<2: continue # skip pathways with less than 2 genes
mr = self.mean_rank(orfs,orfs2rank)
es = self.enrichment_score(orfs,orfs2rank,orfs2score,p=self.p) # always positive, even if negative deviation, since I take abs
higher = 0
for n in range(Nperm):
- perm = random.sample(allgenes,len(orfs)) # compare to ES for random sets of genes of same size
- if self.enrichment_score(perm,orfs2rank,orfs2score,p=self.p)>es: higher += 1
- if n>100 and higher>10: break # adaptive
+ perm = random.sample(allgenes,num_genes_in_pathway) # compare to enrichment score for random sets of genes of same size
+ e2 = self.enrichment_score(perm,orfs2rank,orfs2score,p=self.p)
+ if e2>es: higher += 1
+ if n>100 and higher>10: break # adaptive: can stop after seeing 10 events (permutations with higher ES)
pval = higher/float(n)
- vals = ['#',term,len(orfs),mr,es,pval,ontology.get(term,"?")]
+ vals = ['#',term,num_genes_in_pathway,mr,es,pval,ontology.get(term,"?")]
#sys.stderr.write(' '.join([str(x) for x in vals])+'\n')
pctg=(100.0*i)/Total
text = "Running Pathway Enrichment Method... %5.1f%%" % (pctg)
self.progress_update(text, i)
results.append((term,mr,es,pval))
- results.sort(key=lambda x: x[1])
+ results.sort(key=lambda x: x[1]) # sort on mean rank
pvals = [x[-1] for x in results]
rej,qvals = multitest.fdrcorrection(pvals)
results = [tuple(list(res)+[q]) for res,q in zip(results,qvals)]
=====================================
src/pytransit/analysis/tn5gaps.py
=====================================
@@ -234,7 +234,7 @@ class Tn5GapsMethod(base.SingleConditionMethod):
self.transit_message("Starting Tn5 gaps method")
start_time = time.time()
- self.transit_message("Getting data (May take a while)")
+ self.transit_message("Loading data (May take a while)")
# Combine all wigs
(data,position) = transit_tools.get_validated_data(self.ctrldata, wxobj=self.wxobj)
@@ -256,7 +256,7 @@ class Tn5GapsMethod(base.SingleConditionMethod):
exp_cutoff = exprunmax + 2*stddevrun
# Get the runs
- self.transit_message("Getting non-insertion runs in genome")
+ self.transit_message("Identifying non-insertion runs in genome")
run_arr = tnseq_tools.runs_w_info(counts)
pos_hash = transit_tools.get_pos_hash(self.annotation_path)
@@ -275,13 +275,13 @@ class Tn5GapsMethod(base.SingleConditionMethod):
count += 1
genes = tnseq_tools.get_genes_in_range(pos_hash, run['start'], run['end'])
for gene_orf in genes:
+ gene = genes_obj[gene_orf] # bug fix: moved this up
start,end = gene.start,gene.end
a,b = self.NTerminus,self.CTerminus
if gene.strand=="-": a,b = b,a
start = start+int((end-start)*(a/100.))
end = end-int((end-start)*(b/100.))
- gene = genes_obj[gene_orf]
inter_sz = self.intersect_size([run['start'], run['end']], [start,end]) + 1
percent_overlap = self.calc_overlap([run['start'], run['end']], [start,end])
run_len = run['length']
@@ -296,9 +296,10 @@ class Tn5GapsMethod(base.SingleConditionMethod):
results_per_gene[gene.orf] = [gene.orf, gene.name, gene.desc, gene.k, gene.n, gene.r, inter_sz, run_len, pval]
# Update Progress
- text = "Running Tn5Gaps method... %1.1f%%" % (100.0*count/N)
- self.progress_update(text, count)
-
+ if count%10000==0:
+ text = "Running Tn5Gaps method... %1.1f%%" % (100.0*count/N)
+ self.progress_update(text, count)
+
data = list(results_per_gene.values())
exp_run_len = float(accum)/N
@@ -331,9 +332,10 @@ class Tn5GapsMethod(base.SingleConditionMethod):
self.output.write("#Essential gene count: %d\n" % (sig_genes_count))
self.output.write("#Minimum reads: %d\n" % (self.minread))
self.output.write("#Replicate combination method: %s\n" % (self.replicates))
+ self.output.write("#Insertion density: %0.3f\n" % (pins))
+ self.output.write("#Mean run length: %0.1f\n" % (exp_run_len))
+ self.output.write("#Expected max run length: %0.1f\n" % (exprunmax))
self.output.write("#Minimum significant run length: %d\n" % (min_sig_len))
- self.output.write("#Expected run length: %1.5f\n" % (exp_run_len))
- self.output.write("#Expected max run length: %s\n" % (exprunmax))
self.output.write("#%s\n" % "\t".join(columns))
#self.output.write("#Orf\tName\tDesc\tk\tn\tr\tovr\tlenovr\tpval\tpadj\tcall\n")
=====================================
src/pytransit/analysis/zinb.py
=====================================
@@ -48,7 +48,7 @@ class ZinbMethod(base.MultiConditionMethod):
"""
Zinb
"""
- def __init__(self, combined_wig, metadata, annotation, normalization, output_file, ignored_conditions=[], included_conditions=[], winz=False, nterm=5.0, cterm=5.0, condition="Condition", covars=[], interactions = [], PC=1):
+ def __init__(self, combined_wig, metadata, annotation, normalization, output_file, ignored_conditions=[], included_conditions=[], winz=False, nterm=5.0, cterm=5.0, condition="Condition", covars=[], interactions = [], PC=1, refs=[]):
base.MultiConditionMethod.__init__(self, short_name, long_name, short_desc, long_desc, combined_wig, metadata, annotation, output_file,
normalization=normalization, ignored_conditions=ignored_conditions, included_conditions=included_conditions, nterm=nterm, cterm=cterm)
self.winz = winz
@@ -56,6 +56,7 @@ class ZinbMethod(base.MultiConditionMethod):
self.interactions = interactions
self.condition = condition
self.PC = PC
+ self.refs = refs
@classmethod
def transit_error(self,msg): print("error: %s" % msg) # for some reason, transit_error() in base class or transit_tools doesn't work right; needs @classmethod
@@ -92,19 +93,21 @@ class ZinbMethod(base.MultiConditionMethod):
condition = kwargs.get("-condition", "Condition")
covars = list(filter(None, kwargs.get("-covars", "").split(",")))
interactions = list(filter(None, kwargs.get("-interactions", "").split(",")))
+ refs = kwargs.get("-ref",[]) # list of condition names to use a reference for calculating LFCs
+ if refs!=[]: refs = refs.split(',')
winz = True if "w" in kwargs else False
ignored_conditions = list(filter(None, kwargs.get("-ignore-conditions", "").split(",")))
included_conditions = list(filter(None, kwargs.get("-include-conditions", "").split(",")))
# check for unrecognized flags
- flags = "-n --ignore-conditions --include-conditions -iN -iC -PC --condition --covars --interactions --gene".split()
+ flags = "-n --ignore-conditions --include-conditions -iN -iC -PC --condition --covars --interactions --gene --ref".split()
for arg in rawargs:
if arg[0]=='-' and arg not in flags:
self.transit_error("flag unrecognized: %s" % arg)
print(ZinbMethod.usage_string())
sys.exit(0)
- return self(combined_wig, metadata, annotation, normalization, output_file, ignored_conditions, included_conditions, winz, NTerminus, CTerminus, condition, covars, interactions, PC)
+ return self(combined_wig, metadata, annotation, normalization, output_file, ignored_conditions, included_conditions, winz, NTerminus, CTerminus, condition, covars, interactions, PC, refs)
def wigs_to_conditions(self, conditionsByFile, filenamesInCombWig):
"""
@@ -615,9 +618,10 @@ class ZinbMethod(base.MultiConditionMethod):
Rv = gene["rv"]
means = [statsByRv[Rv]['mean'][group] for group in orderedStatGroupNames]
PC = self.PC
- if len(means)==2: LFCs = [numpy.math.log((means[1]+PC)/(means[0]+PC),2)]
+ if len(means)==2: LFCs = [numpy.math.log((means[1]+PC)/(means[0]+PC),2)] # still need to adapt this to use --ref if defined
else:
- m = numpy.mean(means)
+ if len(self.refs)==0: m = numpy.mean(means) # grand mean across all conditions
+ else: m = numpy.mean([statsByRv[Rv]['mean'][group] for group in self.refs]) ###TRI
LFCs = [numpy.math.log((x+PC)/(m+PC),2) for x in means]
vals = ([Rv, gene["gene"], str(len(RvSiteindexesMap[Rv]))] +
["%0.1f" % statsByRv[Rv]['mean'][group] for group in orderedStatGroupNames] +
@@ -638,6 +642,7 @@ class ZinbMethod(base.MultiConditionMethod):
-n <string> := Normalization method. Default: -n TTR
--ignore-conditions <cond1,cond2> := Comma separated list of conditions to ignore, for the analysis.
--include-conditions <cond1,cond2> := Comma separated list of conditions to include, for the analysis. Conditions not in this list, will be ignored.
+ --ref <cond> := which condition(s) to use as a reference for calculating LFCs (comma-separated if multiple conditions)
-iN <float> := Ignore TAs occuring within given percentage (as integer) of the N terminus. Default: -iN 5
-iC <float> := Ignore TAs occuring within given percentage (as integer) of the C terminus. Default: -iC 5
-PC <N> := pseudocounts to use for calculating LFCs. Default: -PC 5
=====================================
src/pytransit/convert/gff_to_prot_table.py
=====================================
@@ -143,7 +143,7 @@ class GffProtMethod(base.ConvertMethod):
for i, line in enumerate(lines):
line = line.strip()
- if line.startswith('#'): continue
+ if len(line)==0 or line.startswith('#'): continue
cols = line.split('\t')
if (len(cols) < 9):
sys.stderr.write(("Ignoring invalid row with entries: {0}\n".format(cols)))
@@ -160,6 +160,7 @@ class GffProtMethod(base.ConvertMethod):
labels[k.strip()] = v.strip()
Rv = labels["locus_tag"].strip() # error out if not found
gene = labels.get('gene', '') # or Name?
+ if gene=="": gene = '-'
desc = labels.get('product', '')
vals = [desc, start, end, strand, size, '-', '-', gene, Rv, '-']
writer.writerow(vals)
=====================================
src/pytransit/doc/source/conf.py
=====================================
@@ -34,7 +34,7 @@ extensions = [
'sphinx.ext.todo',
'sphinx.ext.viewcode',
'sphinx.ext.napoleon',
- 'sphinx.ext.mathbase',
+ #'sphinx.ext.mathbase',
'sphinx.ext.mathjax',
]
@@ -144,6 +144,15 @@ html_theme = 'sphinx_rtd_theme'
# documentation.
#html_theme_options = {}
+#html_theme_options = {
+# 'collapse_navigation': False,
+# 'sticky_navigation': True,
+# 'navigation_depth': 4,
+# 'includehidden': True,
+# 'titles_only': False
+#}
+
+
# Add any paths that contain custom themes here, relative to this directory.
#html_theme_path = []
@@ -384,5 +393,6 @@ epub_exclude_files = ['search.html']
# If false, no index is generated.
#epub_use_index = True
def setup(app):
- app.add_stylesheet('css/default.css')
+ #app.add_stylesheet('css/default.css')
+ app.add_css_file('css/default.css')
=====================================
src/pytransit/doc/source/index.rst
=====================================
@@ -8,7 +8,13 @@ Welcome to TRANSIT's documentation!
:target: https://github.com/mad-lab/transit
:alt: GitHub last tag
-This page contains the documentation for TRANSIT. Below are a few quick links to some of the most important sections of the documentation, followed by a brief overview of TRANSIT's features.
+Transit is python-based software for analyzing TnSeq data
+(sequencing data from transposon mutant libraries)
+to determine essentiality of bacterial genes under different conditions.
+
+This page contains the documentation for TRANSIT. Below are a few
+quick links to some of the most important sections of the
+documentation, followed by a brief overview of TRANSIT's features.
Quick Links
~~~~~~~~~~~
@@ -20,7 +26,7 @@ Quick Links
* :ref:`tutorial-link`
* :ref:`tpp-link`
* :ref:`code-link`
-
+* `PDF manual with overview of analysis methods in Transit <https://orca1.tamu.edu/essentiality/transit/transit-manual.pdf>`_
Features
~~~~~~~~
@@ -57,8 +63,8 @@ TRANSIT offers a variety of features including:
.. _manual-link:
.. toctree::
- :maxdepth: 2
- :caption: TRANSIT Manual
+ :maxdepth: 3
+ :caption: TRANSIT MANUAL
transit_overview
transit_install
@@ -70,7 +76,7 @@ TRANSIT offers a variety of features including:
.. _tutorial-link:
.. toctree::
- :maxdepth: 2
+ :maxdepth: 3
:caption: TRANSIT Tutorials
transit_essentiality_tutorial
@@ -83,7 +89,7 @@ TRANSIT offers a variety of features including:
.. _tpp-link:
.. toctree::
- :maxdepth: 2
+ :maxdepth: 3
:caption: TPP Manual
tpp.rst
@@ -91,7 +97,7 @@ TRANSIT offers a variety of features including:
.. _code-link:
.. toctree::
- :maxdepth: 2
+ :maxdepth: 3
:caption: Code Documentation
transit
=====================================
src/pytransit/doc/source/transit_methods.rst
=====================================
@@ -2,12 +2,16 @@
.. _`analysis_methods`:
==================
- Analysis Methods
+ Analysis Methods (Expand Me First!)
==================
-TRANSIT has analysis methods capable of analyzing **Himar1** and **Tn5** datasets.
-Below is a description of some of the methods.
+TRANSIT has analysis methods capable of analyzing **Himar1** and
+**Tn5** datasets. Below is a description of some of the methods.
+
+The analysis methods in Transit are also described in this `PDF manual
+<https://orca1.tamu.edu/essentiality/transit/transit-manual.pdf>`_ .
+
|
@@ -221,7 +225,7 @@ AK. (2009). `Simultaneous assay of every Salmonella Typhi gene using one million
transposon mutants. <http://www.ncbi.nlm.nih.gov/pubmed/19826075>`_ *Genome Res.* , 19(12):2308-16.
This data was downloaded from SRA (located `here <http://trace.ncbi.nlm.nih.gov/Traces/sra/?study=ERP000051>`_) , and used to make
-wig files (`base <http://orca1.tamu.edu/essentiality/transit/data/salmonella_base.wig>`_ and `bile <http://orca1.tamu.edu/essentiality/transit/data/salmonella_bile.wig>`_) and the following 4 baseline datasets
+wig files (`baseline <http://orca1.tamu.edu/essentiality/transit/data/salmonella_baseline.wig>`_ and `bile <http://orca1.tamu.edu/essentiality/transit/data/salmonella_bile.wig>`_) and the following 4 baseline datasets
were merged to make a wig file: (IL2_2122_1,3,6,8). Our analysis
produced 415 genes with adjusted p-values less than 0.05, indicating
essentiality, and the analysis from the above paper produced 356
@@ -530,7 +534,7 @@ parameters are available for the method:
calculated will have, at the expense of longer computation time. The
resampling method runs on 10,000 samples by default.
-- **Output Histograms:**\ Determines whether to output .png images of
+- **Output Histograms:** Determines whether to output .png images of
the histograms obtained from resampling the difference in
read-counts.
@@ -552,7 +556,7 @@ parameters are available for the method:
as real differences. See the :ref:`Normalization <normalization>` section for a description
of normalization method available in TRANSIT.
-- **--ctrl_lib, --exp_lib:** These are for doing resampling with datasets from multiple libraries, see below.
+- **\-\-ctrl_lib, \-\-exp_lib:** These are for doing resampling with datasets from multiple libraries, see below.
- **-iN, -iC:** Trimming of TA sites near N- and C-terminus.
The default for trimming TA sites in the termini of ORFs is 0.
@@ -1051,6 +1055,7 @@ Example
-n <string> := Normalization method. Default: -n TTR
--ignore-conditions <cond1,...> := Comma separated list of conditions to ignore, for the analysis. Default: None
--include-conditions <cond1,...> := Comma separated list of conditions to include, for the analysis. Default: All
+ --ref <cond> := which condition(s) to use as a reference for calculating LFCs (comma-separated if multiple conditions)
-iN <float> := Ignore TAs occurring within given percentage (as integer) of the N terminus. Default: -iN 0
-iC <float> := Ignore TAs occurring within given percentage (as integer) of the C terminus. Default: -iC 0
-PC := Pseudocounts to use in calculating LFCs. Default: -PC 5
@@ -1083,12 +1088,22 @@ The filenames should match what is shown in the header of the combined_wig (incl
Parameters
----------
-The following parameters are available for the method:
+The following parameters are available for the ANOVA method:
-- **Ignore Conditions, Include Conditions:** Can use this to drop
- conditions not of interest or specify a particular subset of conditions to use for ANOVA analysis.
+- **\-\-include-conditions:** Includes the given set of conditions from the ZINB test. Conditions not in this list are ignored. Note: this is useful for specifying the order in which the columns are listed in the output file.
-- **Normalization Method:** Determines which normalization method to
+- **\-\-ignore-conditions:** Can use this to drop conditions not of interest.
+
+- **\-\-ref:** Specify which condition to use as a reference for computing LFCs.
+ By default, LFCs for each gene in each condition are calculated with respect
+ to the *grand mean* count across all conditions (so conditions with higher counts will be balanced
+ with conditions with lower counts). However, if there is a defined reference condition
+ in the data, it may be specified using **\-\-ref** (in which case LFCs for that condition will
+ be around 0, and will be positive or negative for the other conditions, depending on whether
+ counts are higher or lower than the reference condintion. If there is more than one
+ condition to use as reference (i.e. pooled), they may be given as a comma-separated list.
+
+- **-n** Normalization Method. Determines which normalization method to
use when comparing datasets. Proper normalization is important as it
ensures that other sources of variability are not mistakenly treated
as real differences. See the :ref:`Normalization <normalization>` section for a description
@@ -1096,6 +1111,8 @@ The following parameters are available for the method:
- **-PC** Pseudocounts to use in calculating LFCs (see below). Default: -PC 5
+
+
Output and Diagnostics
----------------------
@@ -1203,6 +1220,7 @@ Example
-n <string> := Normalization method. Default: -n TTR
--ignore-conditions <cond1,cond2> := Comma separated list of conditions to ignore, for the analysis. Default: None
--include-conditions <cond1,cond2> := Comma separated list of conditions to include, for the analysis. Default: All
+ --ref <cond> := which condition(s) to use as a reference for calculating LFCs (comma-separated if more than one)
-iN <float> := Ignore TAs occuring within given percentage of the N terminus. Default: -iN 5
-iC <float> := Ignore TAs occuring within given percentage of the C terminus. Default: -iC 5
-PC <N> := Pseudocounts used in calculating LFCs in output file. Default: -PC 5
@@ -1265,8 +1283,9 @@ Parameters
The following parameters are available for the method:
-- **Ignore Conditions:** Ignores the given set of conditions from the ZINB test.
-- **Include Conditions:** Includes the given set of conditions from the ZINB test. Conditions not in this list are ignored.
+- **\-\-include-conditions:** Includes the given set of conditions from the ZINB test. Conditions not in this list are ignored. Note: this is useful for specifying the order in which the columns are listed in the output file.
+- **\-\-ignore-conditions:** Ignores the given set of conditions from the ZINB test.
+- **\-\-ref:** which condition to use as a reference when computing LFCs in the output file
- **Normalization Method:** Determines which normalization method to
use when comparing datasets. Proper normalization is important as it
ensures that other sources of variability are not mistakenly treated
@@ -1296,7 +1315,7 @@ strains (think: different 'slopes'). In such a case, we would say strain and tim
If covariates distinguishing the samples are available,
such as batch or library, they may be
-incorporated in the ZINB model by using the **\\-\\-covars** flag and samples
+incorporated in the ZINB model by using the **\-\-covars** flag and samples
metadata file. For example, consider the following samples metadata
file, with a column describing the batch information of each
replicate.
@@ -1311,7 +1330,7 @@ replicate.
chol2 cholesterol /Users/example_data/cholesterol_rep3.wig B2
This information can be included to eliminate variability due to batch by using
-the **\\-\\-covars** flag.
+the **\-\-covars** flag.
::
@@ -1319,7 +1338,7 @@ the **\\-\\-covars** flag.
Similarly, an interaction variable may be included in the model.
-This is specified by the user with the **\\-\\-interactions** flag,
+This is specified by the user with the **\-\-interactions** flag,
followed by the name of a column in the samples metadata to test as the interaction
with the condition. If there are multiple interactions, they may be given as a comma-separated list.
@@ -1336,7 +1355,7 @@ differs depending on the strain, we could do this:
In this case, the condition is implicitly assumed to be the column in the samples metadata file
labeled 'Condition'. If you want to specify a different column to use as the primary condition to
-test (for example, if Treatment were a distinct column), you can use the **\\-\\-condition** flag:
+test (for example, if Treatment were a distinct column), you can use the **\-\-condition** flag:
::
@@ -1495,7 +1514,10 @@ typical threshold for conditional essentiality on is q-value < 0.05.
**LFCs** (log-fold-changes):
For each condition, the LFC is calculated as the log-base-2 of the
ratio of mean insertion count in that condition **relative to the
-mean of means across all the conditions**.
+mean of means across all the conditions** (by default).
+However, you can change this by desginating a specific reference condition using the flag **\-\-ref**.
+(If there are multiple reference conditions, they may be given as a comma separated list.)
+(If you are using interactions, it is more complicated to specify a reference condition by name because they have to include the interactions, e.g. as shown in the column headers in the output file.)
Pseudocount are incorporated to reduce the impact of noise on LFCs, based on the formula below.
The pseudocounts can be adjusted using the -PC flag.
Changing the pseudocounts (via -PC) can reduce the artifactual appearance of genes with
@@ -1723,7 +1745,26 @@ Parameters
`Grossmann S, Bauer S, Robinson PN, Vingron M. Improved detection of overrepresentation of Gene-Ontology annotations with parent child analysis. Bioinformatics. 2007 Nov 15;23(22):3024-31. <https://www.ncbi.nlm.nih.gov/pubmed/17848398>`_
+ For the ONT method in pathway_enrichment, the enrichment for a given
+ GO term can be expressed (in a simplified way, leaving out the
+ pseudocounts) as:
+
+::
+
+ enrichment = log ( (b/q) / (m/p) )
+|
+
+ where:
+
+* b is the number of genes with this GO term in the subset of hits (e.g. conditional essentials from resampling, with qval<0.05)
+* q is the number of genes in the subset of hits with a parent of this GO term
+* m is the total number of genes with this GO term in the genome
+* p is the number of genes in the genome with a parent of this GO term
+
+ So enrichment is the log of the ratio of 2 ratios:
+ 1. the relative abundance of genes with this GO term compared to those with a parent GO term among the hits
+ 2. the relative abundance of genes with this GO term compared to those with a parent GO term in the whole genome
Auxilliary Pathway Files in Transit Data Directory
@@ -1919,7 +1960,7 @@ Usage:
::
- python3 src/transit.py heatmap <anova_or_zinb_output> <heatmap.png> -anova|-zinb [-topk <int>] [-qval <float]
+ python3 src/transit.py heatmap <anova_or_zinb_output> <heatmap.png> -anova|-zinb [-topk <int>] [-qval <float] [-low_mean_filter <int>]
Note that the first optional argument (flag) is required to be either '-anova' or '-zinb', a flag to
indicate the type of file being provided as the second argument.
@@ -1929,6 +1970,7 @@ However, the user may change the selection of genes through 2 flags:
* **-qval <float>**: change qval threshold for selecting genes (default=0.05)
* **-topk <int>**: select top k genes ranked by significance (qval)
+ * **-low_mean_filter <int>**: filter out genes with grand mean count (across all conditions) below this threshold (even if qval<0.05); default is to exclude genes with mean count<5
Here is an example which generates the following image showing the similarities among
several different growth conditions:
@@ -1943,7 +1985,7 @@ several different growth conditions:
Importantly, the heatmap is based only on the subset of genes
-identified as significantly varying (Padj < 0:05, typically only a few
+identified as *significantly varying* (Padj < 0:05, typically only a few
hundred genes) in order to enhance the patterns, since otherwise they would
be washed out by the rest of the genes in the genome, the majority of
which usually do not exhibit significant variation in counts.
View it on GitLab: https://salsa.debian.org/med-team/tnseq-transit/-/compare/f9a54014a01047da490931368f80c851c54b4d8d...ee0f9a77e0696093ad5f76a2c2ef704fe2d0de8e
--
View it on GitLab: https://salsa.debian.org/med-team/tnseq-transit/-/compare/f9a54014a01047da490931368f80c851c54b4d8d...ee0f9a77e0696093ad5f76a2c2ef704fe2d0de8e
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20211012/6f29e3a0/attachment-0001.htm>
More information about the debian-med-commit
mailing list