[med-svn] [Git][med-team/last-align][upstream] New upstream version 1456

Mon Jun 19 14:59:07 BST 2023


Nilesh Patra pushed to branch upstream at Debian Med / last-align


Commits:
455804a1 by Nilesh Patra at 2023-06-19T19:16:30+05:30
New upstream version 1456
- - - - -


7 changed files:

- bin/last-train
- bin/parallel-fasta
- bin/parallel-fastq
- doc/last-cookbook.rst
- doc/last-train.rst
- src/makefile
- test/last-train-test.out


Changes:

=====================================
bin/last-train
=====================================
@@ -854,9 +854,15 @@ def doTraining(opts, args):
     ss = scoresAndScaleFunc(outerScale, matParams, delRatios, insRatios)
     matScores, delCosts, insCosts, scale, rowFreqs, colFreqs = ss
     if not opts.codon:
+        rowSum = sum(rowFreqs)
+        colSum = sum(colFreqs)
         pid = sum(math.exp(matScores[i][i] / scale) * rowFreqs[i] * colFreqs[i]
-                  for i in range(len(matScores))) / sum(colFreqs)
+                  for i in range(len(matScores))) / colSum
+        rowProbs = [i / rowSum for i in rowFreqs]
+        colProbs = [i / colSum for i in colFreqs]
         print("# substitution percent identity: {0:.6}".format(100 * pid))
+        print("# ref letter %:", *(format(100 * i, "#.3") for i in rowProbs))
+        print("# qry letter %:", *(format(100 * i, "#.3") for i in colProbs))
     if opts.X: print("#last -X", opts.X)
     if opts.R: print("#last -R", opts.R)
     if opts.Q: print("#last -Q", opts.Q)


=====================================
bin/parallel-fasta
=====================================
@@ -1,3 +1,3 @@
 #! /bin/sh
 
-exec parallel --round --recstart '>' "$@"
+exec parallel --pipe --round --recstart '>' "$@"


=====================================
bin/parallel-fastq
=====================================
@@ -1,3 +1,3 @@
 #! /bin/sh
 
-exec parallel --round -L8 "$@"
+exec parallel --pipe --round -L8 "$@"


=====================================
doc/last-cookbook.rst
=====================================
@@ -313,6 +313,11 @@ use, add lastdb seeding_ option ``-uMAM4`` or or ``-uMAM8``.  To
 increase them even more, add lastal_ option ``-m100`` (or as high as
 you can bear).
 
+Aligning distantly-related genomes
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+See https://github.com/mcfrith/last-genome-alignments
+
 Large reference sequences
 -------------------------
 


=====================================
doc/last-train.rst
=====================================
@@ -3,25 +3,41 @@ last-train
 
 last-train finds the rates (probabilities) of insertion, deletion, and
 substitutions between two sets of sequences.  It thereby finds
-suitable substitution and gap scores for aligning them.
+suitable substitution and gap scores for aligning them.  You can use
+it like this::
 
-It (probabilistically) aligns the sequences using some initial score
-parameters, then estimates better score parameters based on the
-alignments, and repeats this procedure until the parameters stop
-changing.
+  lastdb mydb reference.fasta
+  last-train mydb queries.fasta > my.train
 
-The usage is like this::
+last-train can read .gz files, or from pipes::
 
-  lastdb mydb reference.fasta
-  last-train mydb queries.fasta
+  bzcat queries.fasta.bz2 | last-train mydb > my.train
 
-last-train prints a summary of each alignment step, followed by the
-final score parameters, in a format that can be read by `lastal's -p
-option <doc/lastal.rst>`_.
+How it works
+------------
 
-last-train can read .gz files, or from pipes::
+1. For sake of speed, last-train just uses some pseudo-random chunks
+   of ``queries.fasta``.
+
+2. It starts with an initial guess for substitution and gap
+   parameters.
+
+3. Using these parameters, it finds similar segments between the
+   chunks and ``reference.fasta``.
+
+   If one part of the chunks matches several parts of
+   ``reference.fasta``, only the best matches are kept.
+
+4. It gets substitution and gap parameters from these similar
+   segments.
+
+5. It uses these parameters to find similar segments more accurately,
+   then gets parameters again, and repeats until the result stops
+   changing.
 
-  bzcat queries.fasta.bz2 | last-train mydb
+last-train prints a summary of each iteration, followed by the final
+score parameters in a format that can be read by `lastal's -p option
+<doc/lastal.rst>`_.
 
 Options
 -------
@@ -44,9 +60,10 @@ Training options
 --gapsym
        Force the insertion costs to equal the deletion costs.
 --pid=PID
-       Ignore alignments with > PID% identity (matches / [matches +
-       mismatches]).  This aims to optimize the parameters for
-       low-similarity alignments (similarly to the BLOSUM matrices).
+       Ignore similar segments with > PID% identity (matches /
+       [matches + mismatches]).  This aims to optimize the parameters
+       for low-similarity alignments (similarly to the BLOSUM
+       matrices).
 --postmask=NUMBER
        By default, last-train ignores alignments of mostly-lowercase
        sequence (by using `last-postmask <doc/last-postmask.rst>`_).
@@ -185,6 +202,7 @@ Bugs
 
 * last-train can fail for various reasons, e.g. if the sequences are
   too dissimilar.  If it fails to find any alignments, you could try
-  reducing the alignment significance_ threshold with option ``-D``.
+  increasing the sample number, or reducing the alignment
+  significance_ threshold with option ``-D``.
 
 .. _significance: doc/last-evalues.rst


=====================================
src/makefile
=====================================
@@ -143,7 +143,7 @@ ScoreMatrixData.hh: ../data/*.mat
 	../build/mat-inc.sh ../data/*.mat > $@
 
 VERSION1 = git describe --dirty
-VERSION2 = echo ' (HEAD -> main, tag: 1454) ' | sed -e 's/.*tag: *//' -e 's/[,) ].*//'
+VERSION2 = echo ' (HEAD -> main, tag: 1456) ' | sed -e 's/.*tag: *//' -e 's/[,) ].*//'
 
 VERSION = \"`test -e ../.git && $(VERSION1) || $(VERSION2)`\"
 


=====================================
test/last-train-test.out
=====================================
@@ -391,6 +391,8 @@ TEST last-train -m1 /tmp/last-train-test < ../examples/mouseMito.fa
 # T   -112    -68   -215     89
 
 # substitution percent identity: 72.6161
+# ref letter %: 28.4 30.3 13.4 27.9
+# qry letter %: 30.8 25.2 12.5 31.5
 #last -t4.4363
 #last -a 26
 #last -A 24
@@ -840,6 +842,8 @@ TEST last-train -m1 -C2 --revsym /tmp/last-train-test ../examples/mouseMito.fa
 # T   -111    -77   -126     82
 
 # substitution percent identity: 71.8801
+# ref letter %: 29.0 21.0 21.0 29.0
+# qry letter %: 33.1 16.9 16.9 33.1
 #last -t4.6475
 #last -a 26
 #last -A 23
@@ -1246,6 +1250,8 @@ TEST last-train -m1 -k16 --matsym --gapsym /tmp/last-train-test ../examples/mous
 # T   -111    -44   -186     89
 
 # substitution percent identity: 73.3262
+# ref letter %: 30.6 27.0 13.0 29.4
+# qry letter %: 30.6 27.0 13.0 29.4
 #last -t4.41768
 #last -a 26
 #last -A 26
@@ -1738,6 +1744,8 @@ TEST last-train -Q1 /tmp/last-train-test bs100.fastq
 # T   -497   -228   -437     61
 
 # substitution percent identity: 76.2866
+# ref letter %: 31.0 23.6 19.4 26.0
+# qry letter %: 31.0 0.0154 19.4 49.5
 #last -Q 1
 #last -t4.28179
 #last -a 39



View it on GitLab: https://salsa.debian.org/med-team/last-align/-/commit/455804a17a1dd7365cb9ed734812479a331cb71d

-- 
View it on GitLab: https://salsa.debian.org/med-team/last-align/-/commit/455804a17a1dd7365cb9ed734812479a331cb71d
You're receiving this email because of your account on salsa.debian.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20230619/2591db39/attachment-0001.htm>