[med-svn] [Git][med-team/tnseq-transit][master] 5 commits: routine-update: New upstream version

Étienne Mollier (@emollier) gitlab at salsa.debian.org
Sun Nov 5 13:13:45 GMT 2023



Étienne Mollier pushed to branch master at Debian Med / tnseq-transit


Commits:
c317e28d by Étienne Mollier at 2023-11-05T13:57:06+01:00
routine-update: New upstream version

- - - - -
4f0b2f8d by Étienne Mollier at 2023-11-05T13:57:18+01:00
New upstream version 3.3.2
- - - - -
8fdba5f6 by Étienne Mollier at 2023-11-05T14:00:05+01:00
Update upstream source from tag 'upstream/3.3.2'

Update to upstream version '3.3.2'
with Debian dir a8cf88ee22e39df06c7d8518d9330e883414c834
- - - - -
63f29a94 by Étienne Mollier at 2023-11-05T14:00:08+01:00
routine-update: Build-Depends: s/dh-python/dh-sequence-python3/

- - - - -
0a8814e0 by Étienne Mollier at 2023-11-05T14:06:38+01:00
routine-update: Ready to upload to unstable

- - - - -


12 changed files:

- CHANGELOG.md
- debian/changelog
- debian/control
- debian/rules
- src/HMM_conf.py
- src/pytransit/__init__.py
- + src/pytransit/doc/source/_images/HMM_distributions.png
- src/pytransit/doc/source/index.rst
- src/pytransit/doc/source/method_HMM.rst
- src/pytransit/doc/source/method_normalization.rst
- + src/pytransit/doc/source/method_tnseq_stats.rst
- src/pytransit/doc/source/transit_features.rst → src/pytransit/doc/source/transit_quality_control.rst


Changes:

=====================================
CHANGELOG.md
=====================================
@@ -3,6 +3,15 @@ All notable changes to this project will be documented in this file.
 
 
 
+## Version 3.3.2 (2023-10-29)
+#### Transit:
+
+Major changes:
+ - added src/HMM_conf.py as a post-processing script for the HMM, to evaluate confidence of essentiality calls for each gene (see documentation on HMM)
+
+Minor changes:
+ - improvements to documentation related to Quality Control
+
 ## Version 3.3.1 (2023-10-18)
 #### Transit:
 


=====================================
debian/changelog
=====================================
@@ -1,3 +1,10 @@
+tnseq-transit (3.3.2-1) unstable; urgency=medium
+
+  * New upstream version
+  * Build-Depends: s/dh-python/dh-sequence-python3/ (routine-update)
+
+ -- Étienne Mollier <emollier at debian.org>  Sun, 05 Nov 2023 14:00:35 +0100
+
 tnseq-transit (3.3.1-1) unstable; urgency=medium
 
   * New upstream version 3.3.1


=====================================
debian/control
=====================================
@@ -5,7 +5,7 @@ Uploaders: Andreas Tille <tille at debian.org>,
 Section: science
 Priority: optional
 Build-Depends: debhelper-compat (= 13),
-               dh-python,
+               dh-sequence-python3,
                python3-dev,
                python3-setuptools,
                python3-numpy,


=====================================
debian/rules
=====================================
@@ -6,7 +6,7 @@ export LC_ALL=C.UTF-8
 include /usr/share/dpkg/default.mk
 
 %:
-	dh $@ --with python3 --buildsystem=pybuild
+	dh $@ --buildsystem=pybuild
 
 override_dh_install:
 	dh_install


=====================================
src/HMM_conf.py
=====================================
@@ -1,6 +1,9 @@
 import sys,numpy
 import scipy.stats
 
+# first pass...
+
+headers = []
 data = []
 all = []
 Sats = {}
@@ -9,7 +12,7 @@ TAs = {}
 Calls = {}
 for line in open(sys.argv[1]):
   line = line.strip()
-  if line[0]=='#': continue
+  if line[0]=='#': headers.append(line); continue
   w = line.split('\t')
   nTA = int(w[3])
   if nTA==0: continue
@@ -53,9 +56,14 @@ def calc_prob(sat,NZmean,meanSat,stdSat,meanNZmean,stdNZmean):
 
 # second pass...
 
+for line in headers[:-1]: print(line)
+newheaders = headers[-1].split('\t') # assume last comment line is column headers
+newheaders += "consis probES probGD probNE probGA conf flag".split()
+print('\t'.join(newheaders))
+
 for line in open(sys.argv[1]):
   line = line.strip()
-  if line[0]=='#': print (line); continue
+  if line[0]=='#': continue
   w = line.split('\t')
   id,Call = w[0],w[-1]
   nTA = int(w[3])
@@ -63,7 +71,6 @@ for line in open(sys.argv[1]):
   votes = [int(x) for x in w[4:8]]
   m = max(votes)
   consistency = m/float(nTA)
-  #prob = scipy.stats.binom.cdf(m,nTA,cons)
   probs = []
   STATES = "ES GD NE GA".split()
   for st in STATES:
@@ -78,7 +85,6 @@ for line in open(sys.argv[1]):
     if (Call=="GD" or Call=="NE") and  (relprobs[1]>0.25 and relprobs[2]>0.25) :flag="ambiguous"
     if (Call=="NE" or Call=="GA") and (relprobs[2]>0.25 and relprobs[3]>0.25):flag="ambiguous"
   
-  #vals = w+[nTA,m,round(consistency,3),round(prob,4)]
   vals = w+[round(consistency,3)]+[round(x,6) for x in relprobs]
   vals += [round(conf,4),flag]
   print ('\t'.join([str(x) for x in vals]))


=====================================
src/pytransit/__init__.py
=====================================
@@ -2,6 +2,6 @@
 __all__ = ["transit_tools", "tnseq_tools", "norm_tools", "stat_tools"]
 
 
-__version__ = "v3.3.1"
+__version__ = "v3.3.2"
 prefix = "[TRANSIT]"
 


=====================================
src/pytransit/doc/source/_images/HMM_distributions.png
=====================================
Binary files /dev/null and b/src/pytransit/doc/source/_images/HMM_distributions.png differ


=====================================
src/pytransit/doc/source/index.rst
=====================================
@@ -78,7 +78,7 @@ TRANSIT offers a variety of features including:
    transit_overview
    transit_install
    transit_running
-   transit_features
+   transit_quality_control
    file_formats
    
 .. toctree::
@@ -91,21 +91,21 @@ TRANSIT offers a variety of features including:
    :maxdepth: 3
    :caption: TnSeq ANALYSIS METHODS
 
+   method_normalization
+   method_tnseq_stats
    method_gumbel
    method_griffin
+   method_ttnfitness
    method_tn5gaps
    method_HMM
    method_resampling
    method_Utest
-   method_GI
    method_anova
-   method_zinb
-   method_normalization
-   method_pathway_enrichment
-   method_tnseq_stats
    method_corrplot
    method_heatmap
-   method_ttnfitness
+   method_GI
+   method_zinb
+   method_pathway_enrichment
 
 .. _tutorial-link:
 


=====================================
src/pytransit/doc/source/method_HMM.rst
=====================================
@@ -130,5 +130,156 @@ Run-time
 
 |
 
+HMM Confidence Scores (Oct 2023)
+---------------------
+
+One of the difficulties in assessing the certainties of the HMM calls
+is that, while the formal state probabilities are calculated at
+individual TA sites, the essentiality calls at the gene level are made
+by taking a vote (the most frequent state among its TA sites), and
+this does not lend itself to such formal certainty quantification.
+However, the reviewers’ question prompted us to devise a novel
+approach to evaluating the confidence of the HMM calls for genes.  In
+the output files, we include the local saturation and mean insertion
+count for each gene, along with the gene-level call and state
+distribution.  We have sometimes noticed that short genes are
+susceptible to being influenced by the essentiality of an adjacent
+region, which is evident by examining the insertion statistics.  For
+example, consider a hypothetical gene with just 2 TA sites that is
+labeled as ES by the HMM but has insertions at both sites.  It might
+be explained by proximity to a large essential gene or region, due to
+the “smoothing” the HMM does across the sequence of TA sites.  Thus we
+can sometimes recognize inaccurate calls by the HMM if the insertion
+statistics of a gene are not consistent with the call.
+
+In our paper on the HMM in Transit `(DeJesus et al, 2013)
+<https://pubmed.ncbi.nlm.nih.gov/24103077/>`_, 
+we showed a 2D plot of the posterior distribution of
+saturation and mean insertion counts (for a high-quality 
+TnSeq reference dataset) for the 4 essentiality states, which demonstrates that
+genes are nicely clustered, with ES genes having near-0 saturation and
+low counts at non-zero sites, NE genes have high saturation and
+counts, GD genes fall in between, and GA genes are almost fully
+saturated with excessively high counts.
+
+.. image:: _images/HMM_distributions.png
+   :width: 400
+   :align: center
+
+(Figure 5 from `(DeJesus et al, 2013) <https://pubmed.ncbi.nlm.nih.gov/24103077/>`_ ) 
+
+We now calculate the distributions of insertion statistics **for each
+dataset** on which the HMM is run, 
+and use it to assess the confidence in each of the essentiality calls.
+We start by calculating the mean and standard
+deviation of saturation and insertion count over all the genes in each
+of the 4 states (ES, GD, NE, and GA).  Then, for each gene, we compute
+the probability density of its local statistics with respect to the
+product of Normal distributions for its called state.  For example,
+suppose a gene g is called state s.  Then:
+
+::
+
+     conf(g|s) = N(sat(g)|μ_sat(s), σ_sat(s)) * N(cnt(g)|μ_cnt(s), σ_cnt(s))
+
+This models the joint probability of the distribution of saturation
+and insertion counts for each class (simplistically) as independent,
+effectively forming “rectangular” regions around each cluster.  We can
+then calculate the probability that the gene belongs to each state
+based on these posterior distributions, and normalize them to sum
+to 1.  Finally, the “confidence” of the HMM call for each gene is the
+normalized probability of the gene’s insertion statistics given the
+posterior distribution for that essentiality state.
+
+This confidence score nicely identifies genes of low confidence, where
+the local saturation and mean insertion count seem inconsistent with
+the HMM call.  The low-confidence genes are biased toward short genes
+(with 1-3 TA sites), though they include some large genes with many TA
+sites as well.  Some of the former are cases where the call of a short
+gene is influenced by an adjacent region.  Some of the latter include
+ambiguous cases like multi-domain proteins, where one domain is
+essential and the other is not.  We observed that there are often
+“borderline” (or ambiguous) genes, which the HMM labels as one state,
+but which are more consistent with a similar state.  For example, in
+some cases, a low-confidence ES gene might have a higher posterior
+probability of GD, based on insertion statistics.  Other cases include
+ambiguities between NE and GD, or NE and GA, especially for genes with
+mean insertion counts on the border between the clusters (posterior
+distributions) in the 2D plot.  After identifying and removing these
+borderline genes (also now indicated in the HMM output files), we find
+that most of the low-confidence genes have confidence scores in the
+approximate range of 0 to 0.25.  The number of low-confidence genes
+can range from a few hundred up to a thousand
+(for which the predicted essentiality call should be ignored), 
+depending on the
+saturation of the input dataset (.wig file); less-saturated datasets
+tend to have more low-confidence genes.
+
+We implemented this procedure as a stand-alone **post-processing script**
+(called '**HMM_conf.py**' in the src/ directory) which is run on HMM output files,
+calculates a “confidence” score for each gene
+and appends this information as extra columns to the HMM output file.
+(In the future, it will be integrated directly into the HMM output,
+obviating the need for a second step.)
+
+::
+
+  usage: python3 src/HMM_conf.py <HMM_output.genes.txt>
+
+  example: 
+
+  > python HMM_conf.py HMM_Ref_sync3_genes.txt > HMM_Ref_sync3_genes.conf.txt
+
+The script adds the following columns:
+
+ * **consis** - consistency (proportion of TA sites representing the majority essentiality state)
+ * **probES** - conditional probability (unnormalized) of the insertions statistics (sat and mean) if the gene were essential 
+ * **probGD** - conditional probability (unnormalized) of the insertions statistics (sat and mean) if the gene were growth-defect
+ * **probNE** - conditional probability (unnormalized) of the insertions statistics (sat and mean) if the gene were non-essential
+ * **probGA** - conditional probability (unnormalized) of the insertions statistics (sat and mean) if the gene were growth-advantaged
+ * **conf** - confidence score (normalized conditional joint probability of the insertion statistics, given the actual essential call made by the HMM)
+ * **flag** - genes that are ambiguous or have low confidence are labeled as such
+  * *low-confidence* means: the HMM call is inconsistent with statistics of insertion counts in gene, so the HMM call should be ignored
+  * *ambiguous* means: the HMM gives high probability to 2 adjacent essentiality states, like ES-or-GD, or GD-or-NE; these could be borderline cases where the gene could be in either category
+
+If consistency<1.0, it means not all the TA sites in the gene agree with the essentiality call, which is made by majority vote. 
+It is OK if a small fraction of TA sites in a gene are labeled otherwise.
+If it is a large fraction (consistency close to 50%), it might be a 'domain-essential'
+(multi-domain gene where one domain is ES and the other is NE).
+
+Here is an example to show what the additional columns look like.
+Note that MAB_0005 was called NE, but only has insertions at 1 out of 4 TA sites
+(sat=25%) with a mean count of only 7, so it is more consistent with ES; hence it is
+flagged as low-confidence (and one should ignore the NE call).
+
+Note!!! *The extra column headers still need to be added to HMM_conf.py: cons
+probES probGD probNE probGA conf flag.*
+
+::
+
+  # avg gene-level consistency of HMM states: 0.9894
+  # state posterior probability distributions:
+  #  ES: genes=364, meanSat=0.038, stdSat=0.075, meanNZmean=12.0, stdNZmean=61.1
+  #  GD: genes=146, meanSat=0.297, stdSat=0.227, meanNZmean=9.6, stdNZmean=16.0
+  #  NE: genes=3971, meanSat=0.816, stdSat=0.169, meanNZmean=147.4, stdNZmean=110.7
+  #  GA: genes=419, meanSat=0.923, stdSat=0.11, meanNZmean=393.3, stdNZmean=210.0
+  #HMM - Genes
+  #command line: python3 ../../transit/src/transit.py hmm TnSeq-Ref-1.wig,TnSeq-Ref-2.wig,TnSeq-Ref-3.wig abscessus.prot_table HMM_Ref_sync3.txt -iN 5 -iC 5
+  #summary of gene calls: ES=364, GD=146, NE=3971, GA=419, N/A=23
+  #key: ES=essential, GD=insertions cause growth-defect, NE=non-essential, GA=insertions confer growth-advantage, N/A=not analyzed (genes with 0 TA sites)
+  #ORF     gene annotation                                TAs ES sites GD sites NE sites GA sites saturation   mean call consis probES  probGD   probNE   probGA   conf   flag
+  MAB_0001 dnaA Chromosomal replication initiator protein 24  24       0         0        0       0.0417       1.00 ES   1.0  0.22865 0.025725 0.000515 0.000169 0.8965
+  MAB_0002 dnaN DNA polymerase III, beta subunit (DnaN)   13  13       0         0        0       0.0769       1.00 ES   1.0  0.13582 0.028284 0.000536 0.000175 0.8241
+  MAB_0003 gnD  6-phosphogluconate dehydrogenase Gnd      19   0       0        19        0       0.9474      81.33 NE   1.0  0.00000 0.000000 0.001181 0.000329 0.7870
+  MAB_0004 recF DNA replication and repair protein RecF   15   0       0        15        0       0.6667      80.80 NE   1.0  0.00000 0.000000 0.001175 0.000307 0.7926
+  MAB_0005 -    hypothetical protein                       4   0       0         4        0       0.2500       7.00 NE   1.0  0.00000 0.052913 0.000661 0.000207 0.0123 low-confidence
+  MAB_0006 -    DNA gyrase (subunit B) GyrB               30  30       0         0        0       0.0000       0.00 ES   1.0  0.12864 0.020461 0.000487 0.000161 0.8590
+  MAB_0007 -    Hypothetical protein                      24   0       0         0       24       0.9167     456.64 GA   1.0  0.00000 0.000000 0.000145 0.000433 0.7487
+  MAB_0008 -    Hypothetical protein                       3   0       0         0        3       0.6667     173.00 GA   1.0  0.00000 0.000000 0.780528 0.219472 0.2195 low-confidence
+  MAB_0009 -    Hypothetical protein                       6   0       0         0        6       1.0000     495.83 GA   1.0  0.00000 0.000000 0.157296 0.842704 0.8427 
+  MAB_0010c -   Hypothetical protein                      14   0       0         0       14       0.9286     390.08 GA   1.0  0.00000 0.000000 0.435214 0.564786 0.5648 ambiguous
+  ...
+
+
 .. rst-class:: transit_sectionend
 ----


=====================================
src/pytransit/doc/source/method_normalization.rst
=====================================
@@ -13,9 +13,9 @@ briefly described below:
 - **TTR:**
     Trimmed Total Reads (TTR), normalized by the total
     read-counts (like totreads), but trims top and bottom 5% of
-    read-counts. **This is the recommended normalization method for most cases**
-    as it has the beneffit of normalizing for difference in
-    saturation in the context of resampling.
+    read-counts. **This is the recommended normalization method for most cases**,
+    as it has the benefit of compensating for differences in
+    saturation (which is especially important for resampling).
 
 - **nzmean:**
     Normalizes datasets to have the same mean over the
@@ -41,16 +41,52 @@ briefly described below:
     distribution is estimated as the mean across the sorted datasets
     at each site, and then the original (unsorted) datasets are
     assigned values from the empirical distribution based on their
-    quantiles.
+    quantiles.  This actually doesn't work well on TnSeq data if a large fraction
+    of TA sites have counts of 0 (ties).
 
 - **betageom:**
     Normalizes the datasets to fit an "ideal" Geometric
     distribution with a variable probability parameter *p*. Specially
-    useful for datasets that contain a large skew. See :ref:`BGC` .
+    useful for datasets that contain a large skew. 
 
 - **nonorm:**
     No normalization is performed.
 
+
+
+.. _BGC:
+
+Beta-Geometric Correction (BGC)
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If you have a "bad" or poorly-behaving or "skewed" dataset (e.g. with mostly low
+counts, dominated by a few high counts), right now the only remedy you
+can try is applying the **Beta-Geometric correction (BGC)**, which is a
+non-linear adjustment to the insertion counts in a wig file to make
+them more like an ideal Geometric distribution 
+(`DeJesus & Ioerger, 2016 <https://www.ncbi.nlm.nih.gov/pubmed/26932272>`_). 
+
+Note, BGC is the only non-linear normalization available in Transit.  All the other
+normalization methods, like TTR, are linear adjustments (scaling) of counts,
+and so they can't correct for skewing.
+
+BGC normalization helps 'de-skew' a dataset by bring the highest insertion
+counts down into the range of a few thousands.  This sometimes improves
+statistical analyses like resampling, etc by reducing noise.
+
+In the GUI, when you are looking at wig files loaded, you can change
+the normalization (e.g. from TTR to betageom) using the drop-down.  Be
+aware that the Beta-Geometric normalization is *computationally-intensive* and
+**might take few minutes per sample**.
+
+If it looks like it might help (i.e. if the QQ-plot fits the diagonal better using BG
+normalization),
+you can create BG-corrected versions of individual wig files by
+exporting them using the *normalize* command
+on the command-line with '-n betageom' to specify normalization.
+
+
+
 Command-line
 ------------
 
@@ -72,7 +108,7 @@ Example
 
   > python3 src/transit.py normalize Rv_1_H37RvRef.wig Rv_1_H37RvRef_BG.wig -n betageom
 
-The normalize command now also works on combined_wig_ files too.
+The normalize command now also works on :ref:`combined_wig <combined_wig>` files too.
 If the input file is a combined_wig file, indicate it with a '-c' flag.
 
 


=====================================
src/pytransit/doc/source/method_tnseq_stats.rst
=====================================
@@ -0,0 +1,76 @@
+.. _tnseq_stats:
+
+tnseq_stats
+========
+
+This command is useful for calculating some metrics on input wig (or
+combined_wig files) for assessement of data quality.
+
+Similar information can be generated through the GUI via
+the menu options 'View->Quality Control' (select samples in main window first).
+
+Usage:
+------
+
+::
+
+  > usage: python3 src/transit.py tnseq_stats <file.wig>+ [-o <output_file>]
+           python3 src/transit.py tnseq_stats -c <combined_wig> [-o <output_file>]
+
+
+It generates a table (tab-separated text file that can be opened in Excel) with the following statistics in it:
+
+=============  ==============================================  =============================================================================================================
+Column Header  Column Definition                                 Comments
+=============  ==============================================  =============================================================================================================
+dataset        Name of sample (wig file)
+density        Fraction of sites with insertions.              "Well-saturated" Himar1 datasets have >30% saturation. Below this, statistical methods may have trouble.
+mean_ct        Average read-count over all TA sites.
+NZMean         Average read-count, excluding empty sites.       A value between 30-200 is usually good for Himar1 datasets. Too high or too low can indicate problems.
+NZMedian       Median read-count, excluding empty sites.        As read-counts can often have spikes, median serves as a good robust estimate.
+max_ct         Largest read-count at any TA site                Useful to determine whether there are outliers/spikes, which may indicate sequencing issues.
+total_cts      Sum of total read-counts in the sample.          Indicates how much sequencing material was obtained. Typically >1M reads is desired for Himar1 datasets.
+skewness       3rd-order moment of read-count distribution.     Sharp peak? Large skew may indicate issues with a dataset. Typically a skew < 50 is desired. May be higher when
+kurtosis       4th-order moment of read-counts distribution     Lop-sided peak?
+PTI            Pickand Tail Index                               Another statistical measure that indicates presence of individual TA sites with high outlier counts. PTI>1.0 is not good.
+=============  ==============================================  =============================================================================================================
+
+
+
+Here is an example:
+
+::
+
+  > python3 src/transit.py tnseq_stats -c src/pytransit/data/cholesterol_glycerol_combined.dat
+  dataset density mean_ct NZmean  NZmedian        max_ct  total_cts       skewness        kurtosis        pickands_tail_index
+  src/pytransit/data/cholesterol_H37Rv_rep1.wig   0.439   139.6   317.6   147     125355.5        10414005        54.8    4237.7  0.973
+  src/pytransit/data/cholesterol_H37Rv_rep2.wig   0.439   171.4   390.5   148     704662.8        12786637        105.8   14216.2 1.529
+  src/pytransit/data/cholesterol_H37Rv_rep3.wig   0.359   173.8   484.2   171     292294.8        12968502        42.2    2328.0  1.584
+  src/pytransit/data/glycerol_H37Rv_rep1.wig      0.419   123.3   294.5   160     8813.3  9195672 4.0     33.0    0.184
+  src/pytransit/data/glycerol_H37Rv_rep2.wig      0.516   123.8   240.1   127     8542.5  9235984 4.0     33.5    0.152
+
+
+In this example, you can see the 5 samples have saturations in the
+range of 35.9-51.6% (which is decent).  The NZMeans are in the range
+123-139, but this is *post-normalization*.  (TTR normalization had
+already been applied to this combined_wig file, so the means can be
+expected to be scaled to around 100.)  If you want to see the NZmeans
+for the *raw data*, re-generate the combined_wig file using '-n
+nonorm', to skip the automatic normalization step.  These samples also
+exhibit *skewness* that is on the high side (33.0-105.8).  This is
+probably related to the fact that some individual TA sites have very
+high counts.  For example, rep2 of cholesterol has a max count of
+704662 at a single TA site, representing over 5% of the 12.78M total
+insertion counts.  TTR is supposed to be robust by ignoring the top 5%
+of most abundant sites during normalization, but still the rest of the
+distribution of counts could be skewed.  This sample also has a high
+*Pickands' tail index* of 1.53 (which is above 1.0), also suggesting
+skew.  While, we currently don't have recommendations for hard cutoffs
+to use for identifying bad samples (e.g. that might need to be
+re-sequenced), I would say that skew>30 and/or PTI>1.0 are signs that
+a sample might be noisy or lower-quality.  See :ref:`Quality Control
+<transit_quality_control>` for more discussion about assessing quality
+of TnSeq datasets.  Nonetheless, doing resampling on this data still
+yielded insights into many genes required for cholesterol metabolism
+in *M. tuberculosis* `(Griffin et al, 2009)
+<https://pubmed.ncbi.nlm.nih.gov/21980284/>`_. 


=====================================
src/pytransit/doc/source/transit_features.rst → src/pytransit/doc/source/transit_quality_control.rst
=====================================
@@ -31,7 +31,11 @@ QC Metrics Table
 
 The Quality Control window contains a table of the datasets and metrics, similar
 to the one in the main TRANSIT interface. This table has an extended set of
-metrics to provide a better picture of the quality of the datasets:
+metrics to provide a better picture of the quality of the datasets (below).
+
+A similar table of dataset statistics can be generated at the command-line
+using the Transit :ref:`tnseq_stats <tnseq_stats>` command.
+
 
 
 =============  ==============================================  =============================================================================================================
@@ -50,6 +54,34 @@ Kurtosis       Kurtosis of the read-counts in the dataset.
 =============  ==============================================  =============================================================================================================
 
 
+Interpretation of Data Quality (Rules of Thumb)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. NOTE:: 
+
+ While there are no
+ rigid criteria for defining "bad" datasets, rules of thumb I use
+ for "good" datasets are: **density>30% (ideally >50%)** and **NZmean>10 (ideally >50)**.  
+ In addition, I look
+ at MaxReadCount and Skewness as indicators.  Typically, MaxReadCount
+ will be in the range of a few thousand to tens-of-thousands.  
+ If you see individual sites with
+ counts in the range of 10\ :sup:`5`\ -10\ :sup:`6` , it might mean you have some positive
+ selection at a site (e.g. biological, or due to PCR jackpotting), and
+ this can have the effect of reducing counts and influencing the
+ distribution at all the other sites.  If MaxReadCount<100, that could also
+ be problematic (either not enough reads, or possibly skewing).
+ Also, **skewness>30** often (but not
+ always) signals a problem.  Kurtosis doesn't seem to be very
+ meaningful.  The reason it is not easy to boil all these down to a
+ simple set of criteria is that some some of these metrics are interdepenent.
+
+ 
+If you have a "bad" or poorly-behaving or "skewed" dataset (e.g. with mostly low
+counts, dominated by a few high counts), right now the only remedy you
+can try is applying the :ref:`Beta-Geometric Correction (BGC) <BGC>`, which is 
+a non-linear normalization procedure for adjusting insertion counts.
+
 
 QC Plots
 ~~~~~~~~
@@ -64,7 +96,7 @@ Figure 1: Read-Count Distribution
 
 
 .. image:: _images/transit_quality_control_histogram.png
-   :width: 600
+   :width: 300
    :align: center
 
 
@@ -77,7 +109,7 @@ Figure 2: QQ-Plot of Read-Counts vs Geometric Distribution
 
 
 .. image:: _images/transit_quality_control_qqplot.png
-   :width: 600
+   :width: 300
    :align: center
 
 
@@ -94,7 +126,7 @@ Figure 3: Ranked plot of Read-Counts
 
 
 .. image:: _images/transit_quality_control_ranked.png
-   :width: 600
+   :width: 300
    :align: center
 
 
@@ -102,66 +134,3 @@ Figure 3: Ranked plot of Read-Counts
 The second plot in the Quality Control window is a plot of the read-counts in sorted order. This may be helpful in indentifying outliers that may exist in the dataset. Typically, some large counts are expected and some normalization methods, like TTR, are robust to such outliers. However, too many outliers, or one single outlier that is overhwelmingly different than the rest may indicate an issue like PCR amplification (especially in libraries constructed older protocols).
 
 
-Interpretation of Data Quality
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-It is important to be able to evaluate the quality of datasets.
-In a nutshell, we look at statistics like saturation, and mean read count,
-but also things like max count and skewness.
-
-There are two ways to do QC in Transit - via the GUI and command-line.  
-In the GUI, one can load a set of
-wig files a select "View->Quality Control" in the menu; this will
-display some plots of read-count distribution.  Ideally, you want most of
-your datasets to fall along the diagonal on a QQ-plot.  Real data will
-often deviate somewhat (I will try to be more quantitative about this in the future),
-but if a dataset skews far off from the diagonal, it could cause problems
-with analytical methods like resampling or the HMM.  
-
-.. image:: http://saclab.tamu.edu/essentiality/transit/QC_example.png
-
-You can also generate the same table to statistics as on the QC panel
-from the command-line using the :ref:`tnseq_stats <tnseq_stats>` command.  
-
-Below the plots are a table of statistics.  While there are not
-rigorous criteria for defining "bad" datasets, rules of thumb I use
-for "good" datasets are: density>30% (ideally >50%) and NZmean>10 (ideally >50).  
-In addition, I look
-at MaxReadCount and Skewness as indicators.  Typically, MaxReadCount
-will be in the range of a few thousand to tens-of-thousands.  
-If you see individual sites with
-counts in the range of 10\ :sup:`5`\ -10\ :sup:`6` , it might mean you have some positive
-selection at a site (e.g. biological, or due to PCR jackpotting), and
-this can have the effect of reducing counts and influencing the
-distribution at all the other sites.  If MaxReadCount<100, that is also
-probably problematic (either not enough reads, or possibly skewing).
-Also, skewness>30 often (but not
-always) signals a problem.  Kurtosis doesn't seem to be very
-meaningful.  The reason it is not easy to boil all these down to a
-simple set of criteria is that some some of the metrics interact with
-each other.  
-
-
-.. _BGC:
-
-Beta-Geometric Correction
-~~~~~~~~~~~~~~~~~~~~~~~~~
-
-If you have a "bad" or poorly-behaving or "skewed" dataset (e.g. with mostly low
-counts, dominated by a few high counts), right now the only remedy you
-can try is applying the **Beta-Geometric correction (BGC)**, which is a
-non-linear adjustment to the insertion counts in a wig file to make
-them more like an ideal Geometric distribution (`DeJesus & Ioerger, 2016 <https://www.ncbi.nlm.nih.gov/pubmed/26932272>`_). (Note, all the
-other normalizations, like TTR, are linear adjustments, and so they
-can't correct for skewing.)
-
-In the GUI, when you are looking, you can change
-the normalization (e.g. from TTR to betageom) using the drop-down.  Be aware that the Beta-Geometric
-normalization is compute-intensive and might take few minutes.
-
-If it looks like it might help (i.e. if the QQ-plot fits the diagonal better using BG
-normalization),
-you can created BG-corrected versions of individual wig files by
-exporting them using the :ref:`normalize command <normalization>` 
-on the command-line with '-n betageom' to specify normalization.
-



View it on GitLab: https://salsa.debian.org/med-team/tnseq-transit/-/compare/2e42512a7f9c064d3c98422b3137515abb87851a...0a8814e0e1c10ad2f6305abd24909c77262d8df0

-- 
View it on GitLab: https://salsa.debian.org/med-team/tnseq-transit/-/compare/2e42512a7f9c064d3c98422b3137515abb87851a...0a8814e0e1c10ad2f6305abd24909c77262d8df0
You're receiving this email because of your account on salsa.debian.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20231105/8e3d68de/attachment-0001.htm>


More information about the debian-med-commit mailing list