[med-svn] [Git][med-team/minimap2][upstream] New upstream version 2.26+dfsg
Michael R. Crusoe (@crusoe)
gitlab at salsa.debian.org
Thu Feb 29 15:02:50 GMT 2024
Michael R. Crusoe pushed to branch upstream at Debian Med / minimap2
Commits:
1f95290f by Michael R. Crusoe at 2024-02-29T15:44:03+01:00
New upstream version 2.26+dfsg
- - - - -
24 changed files:
- − .travis.yml
- Makefile
- NEWS.md
- README.md
- align.c
- cookbook.md
- format.c
- kalloc.c
- kalloc.h
- ksw2.h
- ksw2_extd2_sse.c
- ksw2_exts2_sse.c
- ksw2_extz2_sse.c
- lchain.c
- main.c
- map.c
- minimap.h
- minimap2.1
- misc/paftools.js
- options.c
- + pyproject.toml
- python/mappy.pyx
- seed.c
- setup.py
Changes:
=====================================
.travis.yml deleted
=====================================
@@ -1,24 +0,0 @@
-matrix:
- include:
- - language: c
- compiler: gcc
- script: make
- - language: c
- compiler: clang
- script: make
- - arch: arm64
- language: c
- compiler: gcc
- script: make arm_neon=1 aarch64=1
- - language: python
- python: "2.7"
- before_install: pip install cython
- script: python setup.py build_ext
- - language: python
- python: "3.5"
- before_install: pip install cython
- script: python setup.py build_ext
- - language: python
- python: "3.9"
- before_install: pip install cython
- script: python setup.py build_ext
=====================================
Makefile
=====================================
@@ -8,6 +8,10 @@ PROG= minimap2
PROG_EXTRA= sdust minimap2-lite
LIBS= -lm -lz -lpthread
+ifneq ($(aarch64),)
+ arm_neon=1
+endif
+
ifeq ($(arm_neon),) # if arm_neon is not defined
ifeq ($(sse2only),) # if sse2only is not defined
OBJS+=ksw2_extz2_sse41.o ksw2_extd2_sse41.o ksw2_exts2_sse41.o ksw2_extz2_sse2.o ksw2_extd2_sse2.o ksw2_exts2_sse2.o ksw2_dispatch.o
@@ -26,12 +30,12 @@ endif
ifneq ($(asan),)
CFLAGS+=-fsanitize=address
- LIBS+=-fsanitize=address
+ LIBS+=-fsanitize=address -ldl
endif
ifneq ($(tsan),)
CFLAGS+=-fsanitize=thread
- LIBS+=-fsanitize=thread
+ LIBS+=-fsanitize=thread -ldl
endif
.PHONY:all extra clean depend
=====================================
NEWS.md
=====================================
@@ -1,3 +1,50 @@
+Release 2.26-r1175 (29 April 2023)
+----------------------------------
+
+Fixed the broken Python package. This is the only change.
+
+(2.25: 25 April 2023, r1173)
+
+
+
+Release 2.25-r1173 (25 April 2023)
+----------------------------------
+
+Notable changes:
+
+ * Improvement: use the miniprot splice model for RNA-seq alignment by default.
+ This model considers non-GT-AG splice sites and leads to slightly higher
+ (<0.1%) accuracy and sensitivity on real human data.
+
+ * Change: increased the default `-I` to `8G` such that minimap2 would create a
+ uni-part index for a pair of mammalian genomes. This change may increase the
+ memory for all-vs-all read overlap alignment given large datasets.
+
+ * New feature: output the sequences in secondary alignments with option
+ `--secondary-seq` (#687).
+
+ * Bugfix: --rmq was not parsed correctly (#1010)
+
+ * Bugfix: possibly incorrect coordinate when applying end bonus to the target
+ sequence (#1025). This is a ksw2 bug. It does not affect minimap2 as
+ minimap2 is not using the affected feature.
+
+ * Improvement: incorporated several changes for better compatibility with
+ Windows (#1051) and for minimap2 integration at Oxford Nanopore Technologies
+ (#1048 and #1033).
+
+ * Improvement: output the HD-line in SAM output (#1019).
+
+ * Improvement: check minimap2 index file in mappy to prevent segmentation
+ fault for certain indices (#1008).
+
+For genomic sequences, minimap2 should give identical output to v2.24.
+Long-read RNA-seq alignment may occasionally differ from previous versions.
+
+(2.25: 25 April 2023, r1173)
+
+
+
Release 2.24-r1122 (26 December 2021)
-------------------------------------
=====================================
README.md
=====================================
@@ -74,8 +74,8 @@ Detailed evaluations are available from the [minimap2 paper][doi] or the
Minimap2 is optimized for x86-64 CPUs. You can acquire precompiled binaries from
the [release page][release] with:
```sh
-curl -L https://github.com/lh3/minimap2/releases/download/v2.24/minimap2-2.24_x64-linux.tar.bz2 | tar -jxvf -
-./minimap2-2.24_x64-linux/minimap2
+curl -L https://github.com/lh3/minimap2/releases/download/v2.26/minimap2-2.26_x64-linux.tar.bz2 | tar -jxvf -
+./minimap2-2.26_x64-linux/minimap2
```
If you want to compile from the source, you need to have a C compiler, GNU make
and zlib development files installed. Then type `make` in the source code
@@ -350,6 +350,11 @@ If you use minimap2 in your work, please cite:
> Li, H. (2018). Minimap2: pairwise alignment for nucleotide sequences.
> *Bioinformatics*, **34**:3094-3100. [doi:10.1093/bioinformatics/bty191][doi]
+and/or:
+
+> Li, H. (2021). New strategies to improve minimap2 alignment accuracy.
+> *Bioinformatics*, **37**:4572-4574. [doi:10.1093/bioinformatics/btab705][doi2]
+
## <a name="dguide"></a>Developers' Guide
Minimap2 is not only a command line tool, but also a programming library.
@@ -399,5 +404,6 @@ mappy` or [from BioConda][mappyconda] via `conda install -c bioconda mappy`.
[manpage]: https://lh3.github.io/minimap2/minimap2.html
[manpage-cs]: https://lh3.github.io/minimap2/minimap2.html#10
[doi]: https://doi.org/10.1093/bioinformatics/bty191
-[smide]: https://github.com/nemequ/simde
+[doi2]: https://doi.org/10.1093/bioinformatics/btab705
+[simde]: https://github.com/nemequ/simde
[unimap]: https://github.com/lh3/unimap
=====================================
align.c
=====================================
@@ -326,9 +326,11 @@ static void mm_align_pair(void *km, const mm_mapopt_t *opt, int qlen, const uint
if (opt->max_sw_mat > 0 && (int64_t)tlen * qlen > opt->max_sw_mat) {
ksw_reset_extz(ez);
ez->zdropped = 1;
- } else if (opt->flag & MM_F_SPLICE)
- ksw_exts2_sse(km, qlen, qseq, tlen, tseq, 5, mat, opt->q, opt->e, opt->q2, opt->noncan, zdrop, opt->junc_bonus, flag, junc, ez);
- else if (opt->q == opt->q2 && opt->e == opt->e2)
+ } else if (opt->flag & MM_F_SPLICE) {
+ int flag_tmp = flag;
+ if (!(opt->flag & MM_F_SPLICE_OLD)) flag_tmp |= KSW_EZ_SPLICE_CMPLX;
+ ksw_exts2_sse(km, qlen, qseq, tlen, tseq, 5, mat, opt->q, opt->e, opt->q2, opt->noncan, zdrop, opt->junc_bonus, flag_tmp, junc, ez);
+ } else if (opt->q == opt->q2 && opt->e == opt->e2)
ksw_extz2_sse(km, qlen, qseq, tlen, tseq, 5, mat, opt->q, opt->e, w, zdrop, end_bonus, flag, ez);
else
ksw_extd2_sse(km, qlen, qseq, tlen, tseq, 5, mat, opt->q, opt->e, opt->q2, opt->e2, w, zdrop, end_bonus, flag, ez);
=====================================
cookbook.md
=====================================
@@ -31,8 +31,8 @@ To acquire the data used in this cookbook and to install minimap2 and paftools,
please follow the command lines below:
```sh
# install minimap2 executables
-curl -L https://github.com/lh3/minimap2/releases/download/v2.24/minimap2-2.24_x64-linux.tar.bz2 | tar jxf -
-cp minimap2-2.24_x64-linux/{minimap2,k8,paftools.js} . # copy executables
+curl -L https://github.com/lh3/minimap2/releases/download/v2.26/minimap2-2.26_x64-linux.tar.bz2 | tar jxf -
+cp minimap2-2.26_x64-linux/{minimap2,k8,paftools.js} . # copy executables
export PATH="$PATH:"`pwd` # put the current directory on PATH
# download example datasets
curl -L https://github.com/lh3/minimap2/releases/download/v2.10/cookbook-data.tgz | tar zxf -
=====================================
format.c
=====================================
@@ -119,6 +119,7 @@ int mm_write_sam_hdr(const mm_idx_t *idx, const char *rg, const char *ver, int a
{
kstring_t str = {0,0,0};
int ret = 0;
+ mm_sprintf_lite(&str, "@HD\tVN:1.6\tSO:unsorted\tGO:query\n");
if (idx) {
uint32_t i;
for (i = 0; i < idx->n_seq; ++i)
@@ -369,14 +370,16 @@ static void write_sam_cigar(kstring_t *s, int sam_flag, int in_tag, int qlen, co
clip_len[0] = r->rev? qlen - r->qe : r->qs;
clip_len[1] = r->rev? r->qs : qlen - r->qe;
if (in_tag) {
- int clip_char = (sam_flag&0x800) && !(opt_flag&MM_F_SOFTCLIP)? 5 : 4;
+ int clip_char = (((sam_flag&0x800) || ((sam_flag&0x100) && (opt_flag&MM_F_SECONDARY_SEQ))) &&
+ !(opt_flag&MM_F_SOFTCLIP)) ? 5 : 4;
mm_sprintf_lite(s, "\tCG:B:I");
if (clip_len[0]) mm_sprintf_lite(s, ",%u", clip_len[0]<<4|clip_char);
for (k = 0; k < r->p->n_cigar; ++k)
mm_sprintf_lite(s, ",%u", r->p->cigar[k]);
if (clip_len[1]) mm_sprintf_lite(s, ",%u", clip_len[1]<<4|clip_char);
} else {
- int clip_char = (sam_flag&0x800) && !(opt_flag&MM_F_SOFTCLIP)? 'H' : 'S';
+ int clip_char = (((sam_flag&0x800) || ((sam_flag&0x100) && (opt_flag&MM_F_SECONDARY_SEQ))) &&
+ !(opt_flag&MM_F_SOFTCLIP)) ? 'H' : 'S';
assert(clip_len[0] < qlen && clip_len[1] < qlen);
if (clip_len[0]) mm_sprintf_lite(s, "%d%c", clip_len[0], clip_char);
for (k = 0; k < r->p->n_cigar; ++k)
@@ -451,7 +454,7 @@ void mm_write_sam3(kstring_t *s, const mm_idx_t *mi, const mm_bseq1_t *t, int se
if (cigar_in_tag) {
int slen;
if ((flag & 0x900) == 0 || (opt_flag & MM_F_SOFTCLIP)) slen = t->l_seq;
- else if (flag & 0x100) slen = 0;
+ else if ((flag & 0x100) && !(opt_flag & MM_F_SECONDARY_SEQ)) slen = 0;
else slen = r->qe - r->qs;
mm_sprintf_lite(s, "%dS%dN", slen, r->re - r->rs);
} else write_sam_cigar(s, flag, 0, t->l_seq, r, opt_flag);
@@ -492,7 +495,7 @@ void mm_write_sam3(kstring_t *s, const mm_idx_t *mi, const mm_bseq1_t *t, int se
mm_sprintf_lite(s, "\t");
if (t->qual) sam_write_sq(s, t->qual, t->l_seq, r->rev, 0);
else mm_sprintf_lite(s, "*");
- } else if (flag & 0x100) {
+ } else if ((flag & 0x100) && !(opt_flag & MM_F_SECONDARY_SEQ)){
mm_sprintf_lite(s, "*\t*");
} else {
sam_write_sq(s, t->seq + r->qs, r->qe - r->qs, r->rev, r->rev);
=====================================
kalloc.c
=====================================
@@ -40,7 +40,8 @@ void *km_init2(void *km_par, size_t min_core_size)
kmem_t *km;
km = (kmem_t*)kcalloc(km_par, 1, sizeof(kmem_t));
km->par = km_par;
- km->min_core_size = min_core_size > 0? min_core_size : 0x80000;
+ if (km_par) km->min_core_size = min_core_size > 0? min_core_size : ((kmem_t*)km_par)->min_core_size - 2;
+ else km->min_core_size = min_core_size > 0? min_core_size : 0x80000;
return (void*)km;
}
@@ -183,6 +184,16 @@ void *krealloc(void *_km, void *ap, size_t n_bytes) // TODO: this can be made mo
return q;
}
+void *krelocate(void *km, void *ap, size_t n_bytes)
+{
+ void *p;
+ if (km == 0 || ap == 0) return ap;
+ p = kmalloc(km, n_bytes);
+ memcpy(p, ap, n_bytes);
+ kfree(km, ap);
+ return p;
+}
+
void km_stat(const void *_km, km_stat_t *s)
{
kmem_t *km = (kmem_t*)_km;
@@ -203,3 +214,11 @@ void km_stat(const void *_km, km_stat_t *s)
s->largest = s->largest > size? s->largest : size;
}
}
+
+void km_stat_print(const void *km)
+{
+ km_stat_t st;
+ km_stat(km, &st);
+ fprintf(stderr, "[km_stat] cap=%ld, avail=%ld, largest=%ld, n_core=%ld, n_block=%ld\n",
+ st.capacity, st.available, st.largest, st.n_blocks, st.n_cores);
+}
=====================================
kalloc.h
=====================================
@@ -13,6 +13,7 @@ typedef struct {
void *kmalloc(void *km, size_t size);
void *krealloc(void *km, void *ptr, size_t size);
+void *krelocate(void *km, void *ap, size_t n_bytes);
void *kcalloc(void *km, size_t count, size_t size);
void kfree(void *km, void *ptr);
@@ -20,11 +21,21 @@ void *km_init(void);
void *km_init2(void *km_par, size_t min_core_size);
void km_destroy(void *km);
void km_stat(const void *_km, km_stat_t *s);
+void km_stat_print(const void *km);
#ifdef __cplusplus
}
#endif
+#define Kmalloc(km, type, cnt) ((type*)kmalloc((km), (cnt) * sizeof(type)))
+#define Kcalloc(km, type, cnt) ((type*)kcalloc((km), (cnt), sizeof(type)))
+#define Krealloc(km, type, ptr, cnt) ((type*)krealloc((km), (ptr), (cnt) * sizeof(type)))
+
+#define Kexpand(km, type, a, m) do { \
+ (m) = (m) >= 4? (m) + ((m)>>1) : 16; \
+ (a) = Krealloc(km, type, (a), (m)); \
+ } while (0)
+
#define KMALLOC(km, ptr, len) ((ptr) = (__typeof__(ptr))kmalloc((km), (len) * sizeof(*(ptr))))
#define KCALLOC(km, ptr, len) ((ptr) = (__typeof__(ptr))kcalloc((km), (len), sizeof(*(ptr))))
#define KREALLOC(km, ptr, len) ((ptr) = (__typeof__(ptr))krealloc((km), (ptr), (len) * sizeof(*(ptr))))
@@ -50,7 +61,7 @@ void km_stat(const void *_km, km_stat_t *s);
} kmp_##name##_t; \
SCOPE kmp_##name##_t *kmp_init_##name(void *km) { \
kmp_##name##_t *mp; \
- KCALLOC(km, mp, 1); \
+ mp = Kcalloc(km, kmp_##name##_t, 1); \
mp->km = km; \
return mp; \
} \
@@ -66,7 +77,7 @@ void km_stat(const void *_km, km_stat_t *s);
} \
SCOPE void kmp_free_##name(kmp_##name##_t *mp, kmptype_t *p) { \
--mp->cnt; \
- if (mp->n == mp->max) KEXPAND(mp->km, mp->buf, mp->max); \
+ if (mp->n == mp->max) Kexpand(mp->km, kmptype_t*, mp->buf, mp->max); \
mp->buf[mp->n++] = p; \
}
=====================================
ksw2.h
=====================================
@@ -15,6 +15,7 @@
#define KSW_EZ_SPLICE_FOR 0x100
#define KSW_EZ_SPLICE_REV 0x200
#define KSW_EZ_SPLICE_FLANK 0x400
+#define KSW_EZ_SPLICE_CMPLX 0x800
// The subset of CIGAR operators used by ksw code.
// Use MM_CIGAR_* from minimap.h if you need the full list.
=====================================
ksw2_extd2_sse.c
=====================================
@@ -358,7 +358,7 @@ void ksw_extd2_sse(void *km, int qlen, const uint8_t *query, int tlen, const uin
} else H[0] = v8[0] - qe, max_H = H[0], max_t = 0; // special casing r==0
// update ez
if (en0 == tlen - 1 && H[en0] > ez->mte)
- ez->mte = H[en0], ez->mte_q = r - en;
+ ez->mte = H[en0], ez->mte_q = r - en0;
if (r - st0 == qlen - 1 && H[st0] > ez->mqe)
ez->mqe = H[st0], ez->mqe_t = st0;
if (ksw_apply_zdrop(ez, 1, max_H, r, max_t, zdrop, e2)) break;
=====================================
ksw2_exts2_sse.c
=====================================
@@ -71,6 +71,7 @@ void ksw_exts2_sse(void *km, int qlen, const uint8_t *query, int tlen, const uin
ksw_reset_extz(ez);
if (m <= 1 || qlen <= 0 || tlen <= 0 || q2 <= q + e) return;
+ assert((flag & KSW_EZ_SPLICE_FOR) == 0 || (flag & KSW_EZ_SPLICE_REV) == 0); // can't be both set
zero_ = _mm_set1_epi8(0);
q_ = _mm_set1_epi8(q);
@@ -118,55 +119,93 @@ void ksw_exts2_sse(void *km, int qlen, const uint8_t *query, int tlen, const uin
// set the donor and acceptor arrays. TODO: this assumes 0/1/2/3 encoding!
if (flag & (KSW_EZ_SPLICE_FOR|KSW_EZ_SPLICE_REV)) {
- int semi_cost = flag&KSW_EZ_SPLICE_FLANK? -noncan/2 : 0; // GTr or yAG is worth 0.5 bit; see PMID:18688272
- memset(donor, -noncan, tlen_ * 16);
- memset(acceptor, -noncan, tlen_ * 16);
+ const int sp0[4] = { 8, 15, 21, 30 };
+ int sp[4];
+ if (flag & KSW_EZ_SPLICE_CMPLX) {
+ for (t = 0; t < 4; ++t)
+ sp[t] = (int)((double)sp0[t] / 3. + .499);
+ } else {
+ sp[0] = flag&KSW_EZ_SPLICE_FLANK? noncan / 2 : 0;
+ sp[1] = sp[2] = sp[3] = noncan;
+ }
+ memset(donor, -sp[3], tlen_ * 16);
+ memset(acceptor, -sp[3], tlen_ * 16);
if (!(flag & KSW_EZ_REV_CIGAR)) {
for (t = 0; t < tlen - 4; ++t) {
- int can_type = 0; // type of canonical site: 0=none, 1=GT/AG only, 2=GTr/yAG
- if ((flag & KSW_EZ_SPLICE_FOR) && target[t+1] == 2 && target[t+2] == 3) can_type = 1; // GTr...
- if ((flag & KSW_EZ_SPLICE_REV) && target[t+1] == 1 && target[t+2] == 3) can_type = 1; // CTr...
- if (can_type && (target[t+3] == 0 || target[t+3] == 2)) can_type = 2;
- if (can_type) ((int8_t*)donor)[t] = can_type == 2? 0 : semi_cost;
+ int z = 3;
+ if (flag & KSW_EZ_SPLICE_FOR) {
+ if (target[t+1] == 2 && target[t+2] == 3) // |GT.
+ z = target[t+3] == 0 || target[t+3] == 2? -1 : 0; // |GTr or not
+ else if (target[t+1] == 2 && target[t+2] == 1) z = 1; // |GC.
+ else if (target[t+1] == 0 && target[t+2] == 3) z = 2; // |AT.
+ } else if (flag & KSW_EZ_SPLICE_REV) {
+ if (target[t+1] == 1 && target[t+2] == 3) // |CT. (revcomp of .AG|)
+ z = target[t+3] == 0 || target[t+3] == 2? -1 : 0;
+ else if (target[t+1] == 2 && target[t+2] == 3) z = 2; // |GT. (revcomp of .AC|)
+ }
+ ((int8_t*)donor)[t] = z < 0? 0 : -sp[z];
}
- if (junc)
- for (t = 0; t < tlen - 1; ++t)
- if (((flag & KSW_EZ_SPLICE_FOR) && (junc[t+1]&1)) || ((flag & KSW_EZ_SPLICE_REV) && (junc[t+1]&8)))
- ((int8_t*)donor)[t] += junc_bonus;
for (t = 2; t < tlen; ++t) {
- int can_type = 0;
- if ((flag & KSW_EZ_SPLICE_FOR) && target[t-1] == 0 && target[t] == 2) can_type = 1; // ...yAG
- if ((flag & KSW_EZ_SPLICE_REV) && target[t-1] == 0 && target[t] == 1) can_type = 1; // ...yAC
- if (can_type && (target[t-2] == 1 || target[t-2] == 3)) can_type = 2;
- if (can_type) ((int8_t*)acceptor)[t] = can_type == 2? 0 : semi_cost;
+ int z = 3;
+ if (flag & KSW_EZ_SPLICE_FOR) {
+ if (target[t-1] == 0 && target[t] == 2) // .AG|
+ z = target[t-2] == 1 || target[t-2] == 3? -1 : 0; // yAG| or not
+ else if (target[t-1] == 0 && target[t] == 1) z = 2; // .AC|
+ } else if (flag & KSW_EZ_SPLICE_REV) {
+ if (target[t-1] == 0 && target[t] == 1) // .AC| (revcomp of |GT.)
+ z = target[t-2] == 1 || target[t-2] == 3? -1 : 0; // yAC| or not
+ else if (target[t-1] == 2 && target[t] == 1) z = 1; // .GC| (revcomp of |GC.)
+ else if (target[t-1] == 0 && target[t] == 3) z = 2; // .AT| (revcomp of |AT.)
+ }
+ ((int8_t*)acceptor)[t] = z < 0? 0 : -sp[z];
}
- if (junc)
- for (t = 0; t < tlen; ++t)
- if (((flag & KSW_EZ_SPLICE_FOR) && (junc[t]&2)) || ((flag & KSW_EZ_SPLICE_REV) && (junc[t]&4)))
- ((int8_t*)acceptor)[t] += junc_bonus;
} else {
for (t = 0; t < tlen - 4; ++t) {
- int can_type = 0; // type of canonical site: 0=none, 1=GT/AG only, 2=GTr/yAG
- if ((flag & KSW_EZ_SPLICE_FOR) && target[t+1] == 2 && target[t+2] == 0) can_type = 1; // GAy...
- if ((flag & KSW_EZ_SPLICE_REV) && target[t+1] == 1 && target[t+2] == 0) can_type = 1; // CAy...
- if (can_type && (target[t+3] == 1 || target[t+3] == 3)) can_type = 2;
- if (can_type) ((int8_t*)donor)[t] = can_type == 2? 0 : semi_cost;
+ int z = 3;
+ if (flag & KSW_EZ_SPLICE_FOR) {
+ if (target[t+1] == 2 && target[t+2] == 0) // |GA. (rev of .AG|)
+ z = target[t+3] == 1 || target[t+3] == 3? -1 : 0;
+ else if (target[t+1] == 1 && target[t+2] == 0) z = 2; // |CA. (rev of .AC|)
+ } else if (flag & KSW_EZ_SPLICE_REV) {
+ if (target[t+1] == 1 && target[t+2] == 0) // |CA. (comp of |GT.)
+ z = target[t+3] == 1 || target[t+3] == 3? -1 : 0;
+ else if (target[t+1] == 1 && target[t+2] == 2) z = 1; // |CG. (comp of |GC.)
+ else if (target[t+1] == 3 && target[t+2] == 0) z = 2; // |TA. (comp of |AT.)
+ }
+ ((int8_t*)donor)[t] = z < 0? 0 : -sp[z];
}
- if (junc)
- for (t = 0; t < tlen - 1; ++t)
- if (((flag & KSW_EZ_SPLICE_FOR) && (junc[t+1]&2)) || ((flag & KSW_EZ_SPLICE_REV) && (junc[t+1]&4)))
- ((int8_t*)donor)[t] += junc_bonus;
for (t = 2; t < tlen; ++t) {
- int can_type = 0;
- if ((flag & KSW_EZ_SPLICE_FOR) && target[t-1] == 3 && target[t] == 2) can_type = 1; // ...rTG
- if ((flag & KSW_EZ_SPLICE_REV) && target[t-1] == 3 && target[t] == 1) can_type = 1; // ...rTC
- if (can_type && (target[t-2] == 0 || target[t-2] == 2)) can_type = 2;
- if (can_type) ((int8_t*)acceptor)[t] = can_type == 2? 0 : semi_cost;
+ int z = 3;
+ if (flag & KSW_EZ_SPLICE_FOR) {
+ if (target[t-1] == 3 && target[t] == 2) // .TG| (rev of |GT.)
+ z = target[t-2] == 0 || target[t-2] == 2? -1 : 0;
+ else if (target[t-1] == 1 && target[t] == 2) z = 1; // .CG| (rev of |GC.)
+ else if (target[t-1] == 3 && target[t] == 0) z = 2; // .TA| (rev of |AT.)
+ } else if (flag & KSW_EZ_SPLICE_REV) {
+ if (target[t-1] == 3 && target[t] == 1) // .TC| (comp of .AG|)
+ z = target[t-2] == 0 || target[t-2] == 2? -1 : 0;
+ else if (target[t-1] == 3 && target[t] == 2) z = 2; // .TG| (comp of .AC|)
+ }
+ ((int8_t*)acceptor)[t] = z < 0? 0 : -sp[z];
}
- if (junc)
- for (t = 0; t < tlen; ++t)
- if (((flag & KSW_EZ_SPLICE_FOR) && (junc[t]&1)) || ((flag & KSW_EZ_SPLICE_REV) && (junc[t]&8)))
- ((int8_t*)acceptor)[t] += junc_bonus;
+ }
+ }
+
+ if (junc) {
+ if (!(flag & KSW_EZ_REV_CIGAR)) {
+ for (t = 0; t < tlen - 1; ++t)
+ if (((flag & KSW_EZ_SPLICE_FOR) && (junc[t+1]&1)) || ((flag & KSW_EZ_SPLICE_REV) && (junc[t+1]&8)))
+ ((int8_t*)donor)[t] += junc_bonus;
+ for (t = 0; t < tlen; ++t)
+ if (((flag & KSW_EZ_SPLICE_FOR) && (junc[t]&2)) || ((flag & KSW_EZ_SPLICE_REV) && (junc[t]&4)))
+ ((int8_t*)acceptor)[t] += junc_bonus;
+ } else {
+ for (t = 0; t < tlen - 1; ++t)
+ if (((flag & KSW_EZ_SPLICE_FOR) && (junc[t+1]&2)) || ((flag & KSW_EZ_SPLICE_REV) && (junc[t+1]&4)))
+ ((int8_t*)donor)[t] += junc_bonus;
+ for (t = 0; t < tlen; ++t)
+ if (((flag & KSW_EZ_SPLICE_FOR) && (junc[t]&1)) || ((flag & KSW_EZ_SPLICE_REV) && (junc[t]&8)))
+ ((int8_t*)acceptor)[t] += junc_bonus;
}
}
@@ -376,7 +415,7 @@ void ksw_exts2_sse(void *km, int qlen, const uint8_t *query, int tlen, const uin
} else H[0] = v8[0] - qe, max_H = H[0], max_t = 0; // special casing r==0
// update ez
if (en0 == tlen - 1 && H[en0] > ez->mte)
- ez->mte = H[en0], ez->mte_q = r - en;
+ ez->mte = H[en0], ez->mte_q = r - en0;
if (r - st0 == qlen - 1 && H[st0] > ez->mqe)
ez->mqe = H[st0], ez->mqe_t = st0;
if (ksw_apply_zdrop(ez, 1, max_H, r, max_t, zdrop, 0)) break;
=====================================
ksw2_extz2_sse.c
=====================================
@@ -269,7 +269,7 @@ void ksw_extz2_sse(void *km, int qlen, const uint8_t *query, int tlen, const uin
} else H[0] = v8[0] - qe - qe, max_H = H[0], max_t = 0; // special casing r==0
// update ez
if (en0 == tlen - 1 && H[en0] > ez->mte)
- ez->mte = H[en0], ez->mte_q = r - en;
+ ez->mte = H[en0], ez->mte_q = r - en0;
if (r - st0 == qlen - 1 && H[st0] > ez->mqe)
ez->mqe = H[st0], ez->mqe_t = st0;
if (ksw_apply_zdrop(ez, 1, max_H, r, max_t, zdrop, e)) break;
=====================================
lchain.c
=====================================
@@ -35,7 +35,7 @@ uint64_t *mg_chain_backtrack(void *km, int64_t n, const int32_t *f, const int64_
for (i = 0, n_z = 0; i < n; ++i) // precompute n_z
if (f[i] >= min_sc) ++n_z;
if (n_z == 0) return 0;
- KMALLOC(km, z, n_z);
+ z = Kmalloc(km, mm128_t, n_z);
for (i = 0, k = 0; i < n; ++i) // populate z[]
if (f[i] >= min_sc) z[k].x = f[i], z[k++].y = i;
radix_sort_128x(z, z + n_z);
@@ -54,7 +54,7 @@ uint64_t *mg_chain_backtrack(void *km, int64_t n, const int32_t *f, const int64_
else n_v = n_v0;
}
}
- KMALLOC(km, u, n_u);
+ u = Kmalloc(km, uint64_t, n_u);
memset(t, 0, n * 4);
for (k = n_z - 1, n_v = n_u = 0; k >= 0; --k) { // populate u[]
if (t[z[k].y] == 0) {
@@ -82,7 +82,7 @@ static mm128_t *compact_a(void *km, int32_t n_u, uint64_t *u, int32_t n_v, int32
int64_t i, j, k;
// write the result to b[]
- KMALLOC(km, b, n_v);
+ b = Kmalloc(km, mm128_t, n_v);
for (i = 0, k = 0; i < n_u; ++i) {
int32_t k0 = k, ni = (int32_t)u[i];
for (j = 0; j < ni; ++j)
@@ -91,13 +91,13 @@ static mm128_t *compact_a(void *km, int32_t n_u, uint64_t *u, int32_t n_v, int32
kfree(km, v);
// sort u[] and a[] by the target position, such that adjacent chains may be joined
- KMALLOC(km, w, n_u);
+ w = Kmalloc(km, mm128_t, n_u);
for (i = k = 0; i < n_u; ++i) {
w[i].x = b[k].x, w[i].y = (uint64_t)k<<32|i;
k += (int32_t)u[i];
}
radix_sort_128x(w, w + n_u);
- KMALLOC(km, u2, n_u);
+ u2 = Kmalloc(km, uint64_t, n_u);
for (i = k = 0; i < n_u; ++i) {
int32_t j = (int32_t)w[i].y, n = (int32_t)u[j];
u2[i] = u[j];
@@ -138,7 +138,7 @@ static inline int32_t comput_sc(const mm128_t *ai, const mm128_t *aj, int32_t ma
}
/* Input:
- * a[].x: tid<<33 | rev<<32 | tpos
+ * a[].x: rev<<63 | tid<<32 | tpos
* a[].y: flags<<40 | q_span<<32 | q_pos
* Output:
* n_u: #chains
@@ -160,10 +160,10 @@ mm128_t *mg_lchain_dp(int max_dist_x, int max_dist_y, int bw, int max_skip, int
if (max_dist_x < bw) max_dist_x = bw;
if (max_dist_y < bw && !is_cdna) max_dist_y = bw;
if (is_cdna) max_drop = INT32_MAX;
- KMALLOC(km, p, n);
- KMALLOC(km, f, n);
- KMALLOC(km, v, n);
- KCALLOC(km, t, n);
+ p = Kmalloc(km, int64_t, n);
+ f = Kmalloc(km, int32_t, n);
+ v = Kmalloc(km, int32_t, n);
+ t = Kcalloc(km, int32_t, n);
// fill the score and backtrack arrays
for (i = 0, max_ii = -1; i < n; ++i) {
@@ -251,7 +251,7 @@ mm128_t *mg_lchain_rmq(int max_dist, int max_dist_inner, int bw, int max_chn_ski
int64_t n, mm128_t *a, int *n_u_, uint64_t **_u, void *km)
{
int32_t *f,*t, *v, n_u, n_v, mmax_f = 0, max_rmq_size = 0, max_drop = bw;
- int64_t *p, i, i0, st = 0, st_inner = 0, n_iter = 0;
+ int64_t *p, i, i0, st = 0, st_inner = 0;
uint64_t *u;
lc_elem_t *root = 0, *root_inner = 0;
void *mem_mp = 0;
@@ -264,10 +264,10 @@ mm128_t *mg_lchain_rmq(int max_dist, int max_dist_inner, int bw, int max_chn_ski
}
if (max_dist < bw) max_dist = bw;
if (max_dist_inner <= 0 || max_dist_inner >= max_dist) max_dist_inner = 0;
- KMALLOC(km, p, n);
- KMALLOC(km, f, n);
- KCALLOC(km, t, n);
- KMALLOC(km, v, n);
+ p = Kmalloc(km, int64_t, n);
+ f = Kmalloc(km, int32_t, n);
+ t = Kcalloc(km, int32_t, n);
+ v = Kmalloc(km, int32_t, n);
mem_mp = km_init2(km, 0x10000);
mp = kmp_init_rmq(mem_mp);
@@ -345,7 +345,6 @@ mm128_t *mg_lchain_rmq(int max_dist, int max_dist_inner, int bw, int max_chn_ski
}
if (!krmq_itr_prev(lc_elem, &itr)) break;
}
- n_iter += n_rmq_iter;
}
}
}
=====================================
main.c
=====================================
@@ -7,8 +7,6 @@
#include "mmpriv.h"
#include "ketopt.h"
-#define MM_VERSION "2.24-r1122"
-
#ifdef __linux__
#include <sys/resource.h>
#include <sys/time.h>
@@ -78,6 +76,7 @@ static ko_longopt_t long_options[] = {
{ "chain-skip-scale",ko_required_argument,351 },
{ "print-chains", ko_no_argument, 352 },
{ "no-hash-name", ko_no_argument, 353 },
+ { "secondary-seq", ko_no_argument, 354 },
{ "help", ko_no_argument, 'h' },
{ "max-intron-len", ko_required_argument, 'G' },
{ "version", ko_no_argument, 'V' },
@@ -121,7 +120,7 @@ static inline void yes_or_no(mm_mapopt_t *opt, int64_t flag, int long_idx, const
int main(int argc, char *argv[])
{
- const char *opt_str = "2aSDw:k:K:t:r:f:Vv:g:G:I:d:XT:s:x:Hcp:M:n:z:A:B:O:E:m:N:Qu:R:hF:LC:yYPo:e:U:";
+ const char *opt_str = "2aSDw:k:K:t:r:f:Vv:g:G:I:d:XT:s:x:Hcp:M:n:z:A:B:O:E:m:N:Qu:R:hF:LC:yYPo:e:U:J:";
ketopt_t o = KETOPT_INIT;
mm_mapopt_t opt;
mm_idxopt_t ipt;
@@ -187,7 +186,12 @@ int main(int argc, char *argv[])
else if (c == 'R') rg = o.arg;
else if (c == 'h') fp_help = stdout;
else if (c == '2') opt.flag |= MM_F_2_IO_THREADS;
- else if (c == 'o') {
+ else if (c == 'J') {
+ int t;
+ t = atoi(o.arg);
+ if (t == 0) opt.flag |= MM_F_SPLICE_OLD;
+ else if (t == 1) opt.flag &= ~MM_F_SPLICE_OLD;
+ } else if (c == 'o') {
if (strcmp(o.arg, "-") != 0) {
if (freopen(o.arg, "wb", stdout) == NULL) {
fprintf(stderr, "[ERROR]\033[1;31m failed to write the output to file '%s'\033[0m: %s\n", o.arg, strerror(errno));
@@ -237,6 +241,7 @@ int main(int argc, char *argv[])
else if (c == 350) opt.q_occ_frac = atof(o.arg); // --q-occ-frac
else if (c == 352) mm_dbg_flag |= MM_DBG_PRINT_CHAIN; // --print-chains
else if (c == 353) opt.flag |= MM_F_NO_HASH_NAME; // --no-hash-name
+ else if (c == 354) opt.flag |= MM_F_SECONDARY_SEQ; // --secondary-seq
else if (c == 330) {
fprintf(stderr, "[WARNING] \033[1;31m --lj-min-ratio has been deprecated.\033[0m\n");
} else if (c == 314) { // --frag
@@ -261,7 +266,8 @@ int main(int argc, char *argv[])
} else if (c == 326) { // --dual
yes_or_no(&opt, MM_F_NO_DUAL, o.longidx, o.arg, 0);
} else if (c == 347) { // --rmq
- yes_or_no(&opt, MM_F_RMQ, o.longidx, o.arg, 1);
+ if (o.arg) yes_or_no(&opt, MM_F_RMQ, o.longidx, o.arg, 1);
+ else opt.flag |= MM_F_RMQ;
} else if (c == 'S') {
opt.flag |= MM_F_OUT_CS | MM_F_CIGAR | MM_F_OUT_CS_LONG;
if (mm_verbose >= 2)
@@ -322,7 +328,7 @@ int main(int argc, char *argv[])
fprintf(fp_help, " -H use homopolymer-compressed k-mer (preferrable for PacBio)\n");
fprintf(fp_help, " -k INT k-mer size (no larger than 28) [%d]\n", ipt.k);
fprintf(fp_help, " -w INT minimizer window size [%d]\n", ipt.w);
- fprintf(fp_help, " -I NUM split index for every ~NUM input bases [4G]\n");
+ fprintf(fp_help, " -I NUM split index for every ~NUM input bases [8G]\n");
fprintf(fp_help, " -d FILE dump index to FILE []\n");
fprintf(fp_help, " Mapping:\n");
fprintf(fp_help, " -f FLOAT filter out top FLOAT fraction of repetitive minimizers [%g]\n", opt.mid_occ_frac);
@@ -344,6 +350,7 @@ int main(int argc, char *argv[])
fprintf(fp_help, " -z INT[,INT] Z-drop score and inversion Z-drop score [%d,%d]\n", opt.zdrop, opt.zdrop_inv);
fprintf(fp_help, " -s INT minimal peak DP alignment score [%d]\n", opt.min_dp_max);
fprintf(fp_help, " -u CHAR how to find GT-AG. f:transcript strand, b:both strands, n:don't match GT-AG [n]\n");
+ fprintf(fp_help, " -J INT splice mode. 0: original minimap2 model; 1: miniprot model [1]\n");
fprintf(fp_help, " Input/Output:\n");
fprintf(fp_help, " -a output in the SAM format (PAF by default)\n");
fprintf(fp_help, " -o FILE output alignments to FILE [stdout]\n");
=====================================
map.c
=====================================
@@ -10,11 +10,6 @@
#include "bseq.h"
#include "khash.h"
-struct mm_tbuf_s {
- void *km;
- int rep_len, frag_gap;
-};
-
mm_tbuf_t *mm_tbuf_init(void)
{
mm_tbuf_t *b;
=====================================
minimap.h
=====================================
@@ -5,41 +5,45 @@
#include <stdio.h>
#include <sys/types.h>
-#define MM_F_NO_DIAG 0x001 // no exact diagonal hit
-#define MM_F_NO_DUAL 0x002 // skip pairs where query name is lexicographically larger than target name
-#define MM_F_CIGAR 0x004
-#define MM_F_OUT_SAM 0x008
-#define MM_F_NO_QUAL 0x010
-#define MM_F_OUT_CG 0x020
-#define MM_F_OUT_CS 0x040
-#define MM_F_SPLICE 0x080 // splice mode
-#define MM_F_SPLICE_FOR 0x100 // match GT-AG
-#define MM_F_SPLICE_REV 0x200 // match CT-AC, the reverse complement of GT-AG
-#define MM_F_NO_LJOIN 0x400
-#define MM_F_OUT_CS_LONG 0x800
-#define MM_F_SR 0x1000
-#define MM_F_FRAG_MODE 0x2000
-#define MM_F_NO_PRINT_2ND 0x4000
-#define MM_F_2_IO_THREADS 0x8000
-#define MM_F_LONG_CIGAR 0x10000
-#define MM_F_INDEPEND_SEG 0x20000
-#define MM_F_SPLICE_FLANK 0x40000
-#define MM_F_SOFTCLIP 0x80000
-#define MM_F_FOR_ONLY 0x100000
-#define MM_F_REV_ONLY 0x200000
-#define MM_F_HEAP_SORT 0x400000
-#define MM_F_ALL_CHAINS 0x800000
-#define MM_F_OUT_MD 0x1000000
-#define MM_F_COPY_COMMENT 0x2000000
-#define MM_F_EQX 0x4000000 // use =/X instead of M
-#define MM_F_PAF_NO_HIT 0x8000000 // output unmapped reads to PAF
-#define MM_F_NO_END_FLT 0x10000000
-#define MM_F_HARD_MLEVEL 0x20000000
-#define MM_F_SAM_HIT_ONLY 0x40000000
+#define MM_VERSION "2.26-r1175"
+
+#define MM_F_NO_DIAG (0x001LL) // no exact diagonal hit
+#define MM_F_NO_DUAL (0x002LL) // skip pairs where query name is lexicographically larger than target name
+#define MM_F_CIGAR (0x004LL)
+#define MM_F_OUT_SAM (0x008LL)
+#define MM_F_NO_QUAL (0x010LL)
+#define MM_F_OUT_CG (0x020LL)
+#define MM_F_OUT_CS (0x040LL)
+#define MM_F_SPLICE (0x080LL) // splice mode
+#define MM_F_SPLICE_FOR (0x100LL) // match GT-AG
+#define MM_F_SPLICE_REV (0x200LL) // match CT-AC, the reverse complement of GT-AG
+#define MM_F_NO_LJOIN (0x400LL)
+#define MM_F_OUT_CS_LONG (0x800LL)
+#define MM_F_SR (0x1000LL)
+#define MM_F_FRAG_MODE (0x2000LL)
+#define MM_F_NO_PRINT_2ND (0x4000LL)
+#define MM_F_2_IO_THREADS (0x8000LL)
+#define MM_F_LONG_CIGAR (0x10000LL)
+#define MM_F_INDEPEND_SEG (0x20000LL)
+#define MM_F_SPLICE_FLANK (0x40000LL)
+#define MM_F_SOFTCLIP (0x80000LL)
+#define MM_F_FOR_ONLY (0x100000LL)
+#define MM_F_REV_ONLY (0x200000LL)
+#define MM_F_HEAP_SORT (0x400000LL)
+#define MM_F_ALL_CHAINS (0x800000LL)
+#define MM_F_OUT_MD (0x1000000LL)
+#define MM_F_COPY_COMMENT (0x2000000LL)
+#define MM_F_EQX (0x4000000LL) // use =/X instead of M
+#define MM_F_PAF_NO_HIT (0x8000000LL) // output unmapped reads to PAF
+#define MM_F_NO_END_FLT (0x10000000LL)
+#define MM_F_HARD_MLEVEL (0x20000000LL)
+#define MM_F_SAM_HIT_ONLY (0x40000000LL)
#define MM_F_RMQ (0x80000000LL)
#define MM_F_QSTRAND (0x100000000LL)
#define MM_F_NO_INV (0x200000000LL)
#define MM_F_NO_HASH_NAME (0x400000000LL)
+#define MM_F_SPLICE_OLD (0x800000000LL)
+#define MM_F_SECONDARY_SEQ (0x1000000000LL) //output SEQ field for seqondary alignments using hard clipping
#define MM_I_HPC 0x1
#define MM_I_NO_SEQ 0x2
@@ -189,6 +193,11 @@ typedef struct {
} mm_idx_reader_t;
// memory buffer for thread-local storage during mapping
+struct mm_tbuf_s {
+ void *km;
+ int rep_len, frag_gap;
+};
+
typedef struct mm_tbuf_s mm_tbuf_t;
// global variables
=====================================
minimap2.1
=====================================
@@ -1,4 +1,4 @@
-.TH minimap2 1 "18 December 2021" "minimap2-2.24 (r1122)" "Bioinformatics tools"
+.TH minimap2 1 "29 April 2023" "minimap2-2.26 (r1175)" "Bioinformatics tools"
.SH NAME
.PP
minimap2 - mapping and alignment between collections of DNA sequences
@@ -79,6 +79,19 @@ Minimizer k-mer length [15]
.BI -w \ INT
Minimizer window size [10]. A minimizer is the smallest k-mer
in a window of w consecutive k-mers.
+.TP
+.BI -j \ INT
+Syncmer submer size [10]. Option
+.B -j
+and
+.B -w
+will override each: if
+.B -w
+is applied after
+.BR -j ,
+.B -j
+will have no effect, and vice versa.
+
.TP
.B -H
Use homopolymer-compressed (HPC) minimizers. An HPC sequence is constructed by
@@ -88,16 +101,17 @@ on the HPC sequence.
.BI -I \ NUM
Load at most
.I NUM
-target bases into RAM for indexing [4G]. If there are more than
+target bases into RAM for indexing [8G]. If there are more than
.I NUM
bases in
.IR target.fa ,
minimap2 needs to read
.I query.fa
-multiple times to map it against each batch of target sequences.
+multiple times to map it against each batch of target sequences. This would create a multi-part index.
.I NUM
may be ending with k/K/m/M/g/G. NB: mapping quality is incorrect given a
-multi-part index.
+multi-part index. See also option
+.BR --split-prefix .
.TP
.B --idx-no-seq
Don't store target sequences in the index. It saves disk space and memory but
@@ -587,7 +601,7 @@ Up to 20% sequence divergence.
.B splice
Long-read spliced alignment
.RB ( -k15
-.B -w5 --splice -g2k -G200k -A1 -B2 -O2,32 -E1,0 -b0 -C9 -z200 -ub --junc-bonus=9 --cap-sw-mem=0
+.B -w5 --splice -g2k -G200k -A1 -B2 -O2,32 -E1,0 -C9 -z200 -ub --junc-bonus=9 --cap-sw-mem=0
.BR --splice-flank=yes ).
In the splice mode, 1) long deletions are taken as introns and represented as
the
=====================================
misc/paftools.js
=====================================
@@ -1,6 +1,6 @@
#!/usr/bin/env k8
-var paftools_version = '2.24-r1122';
+var paftools_version = '2.26-r1175';
/*****************************
***** Library functions *****
@@ -1532,22 +1532,24 @@ function paf_view(args)
function paf_gff2bed(args)
{
- var c, fn_ucsc_fai = null, is_short = false, keep_gff = false, print_junc = false, output_gene = false;
- while ((c = getopt(args, "u:sgjG")) != null) {
+ var c, fn_ucsc_fai = null, is_short = false, keep_gff = false, print_junc = false, output_gene = false, ens_canon_only = false;
+ while ((c = getopt(args, "u:sgjGe")) != null) {
if (c == 'u') fn_ucsc_fai = getopt.arg;
else if (c == 's') is_short = true;
else if (c == 'g') keep_gff = true;
else if (c == 'j') print_junc = true;
else if (c == 'G') output_gene = true;
+ else if (c == 'e') ens_canon_only = true;
}
if (getopt.ind == args.length) {
print("Usage: paftools.js gff2bed [options] <in.gff>");
print("Options:");
- print(" -j Output junction BED");
- print(" -s Print names in the short form");
+ print(" -j output junction BED");
+ print(" -s print names in the short form");
print(" -u FILE hg38.fa.fai for chr name conversion");
- print(" -g Output GFF (used with -u)");
+ print(" -e only show transcript tagged with 'Ensembl_canonical'");
+ print(" -g output GFF (used with -u)");
exit(1);
}
@@ -1606,7 +1608,7 @@ function paf_gff2bed(args)
print(a[0][0], st, en, name, 1000, a[0][3], cds_st, cds_en, color, a.length, sizes.join(",") + ",", starts.join(",") + ",");
}
- var re_gtf = /\b(transcript_id|transcript_type|transcript_biotype|gene_name|gene_id|gbkey|transcript_name) "([^"]+)";/g;
+ var re_gtf = /\b(transcript_id|transcript_type|transcript_biotype|gene_name|gene_id|gbkey|transcript_name|tag) "([^"]+)";/g;
var re_gff3 = /\b(transcript_id|transcript_type|transcript_biotype|gene_name|gene_id|gbkey|transcript_name)=([^;]+)/g;
var re_gtf_gene = /\b(gene_id|gene_type|gene_name) "([^;]+)";/g;
var re_gff3_gene = /\b(gene_id|gene_type|source_gene|gene_biotype|gene_name)=([^;]+);/g;
@@ -1646,13 +1648,14 @@ function paf_gff2bed(args)
if (t[2] != "CDS" && t[2] != "exon") continue;
t[3] = parseInt(t[3]) - 1;
t[4] = parseInt(t[4]);
- var id = null, type = "", name = "N/A", biotype = "", m, tname = "N/A";
+ var id = null, type = "", name = "N/A", biotype = "", m, tname = "N/A", ens_canonical = false;
while ((m = re_gtf.exec(t[8])) != null) {
if (m[1] == "transcript_id") id = m[2];
else if (m[1] == "transcript_type") type = m[2];
else if (m[1] == "transcript_biotype" || m[1] == "gbkey") biotype = m[2];
else if (m[1] == "gene_name" || m[1] == "gene_id") name = m[2];
else if (m[1] == "transcript_name") tname = m[2];
+ else if (m[1] == "tag" && m[2] == "Ensembl_canonical") ens_canonical = true;
}
while ((m = re_gff3.exec(t[8])) != null) {
if (m[1] == "transcript_id") id = m[2];
@@ -1661,6 +1664,7 @@ function paf_gff2bed(args)
else if (m[1] == "gene_name" || m[1] == "gene_id") name = m[2];
else if (m[1] == "transcript_name") tname = m[2];
}
+ if (ens_canon_only && !ens_canonical) continue;
if (type == "" && biotype != "") type = biotype;
if (id == null) throw Error("No transcript_id");
if (id != last_id) {
@@ -2341,12 +2345,15 @@ function paf_pbsim2fq(args)
function paf_junceval(args)
{
- var c, l_fuzzy = 0, print_ovlp = false, print_err_only = false, first_only = false, chr_only = false;
- while ((c = getopt(args, "l:epc")) != null) {
+ var c, l_fuzzy = 0, print_ovlp = false, print_err_only = false, first_only = false, chr_only = false, aa = false, is_bed = false;
+ while ((c = getopt(args, "l:epcab1")) != null) {
if (c == 'l') l_fuzzy = parseInt(getopt.arg);
else if (c == 'e') print_err_only = print_ovlp = true;
else if (c == 'p') print_ovlp = true;
else if (c == 'c') chr_only = true;
+ else if (c == 'a') aa = true;
+ else if (c == 'b') is_bed = true;
+ else if (c == '1') first_only = true;
}
if (args.length - getopt.ind < 1) {
@@ -2356,6 +2363,9 @@ function paf_junceval(args)
print(" -p print overlapping introns");
print(" -e print erroreous overlapping introns");
print(" -c only consider alignments to /^(chr)?([0-9]+|X|Y)$/");
+ print(" -a miniprot PAF as input");
+ print(" -b BED as input");
+ print(" -1 only process the first alignment of each query");
exit(1);
}
@@ -2409,13 +2419,17 @@ function paf_junceval(args)
file = getopt.ind+1 >= args.length || args[getopt.ind+1] == '-'? new File() : new File(args[getopt.ind+1]);
var last_qname = null;
- var re_cigar = /(\d+)([MIDNSHP=X])/g;
+ var re_cigar = /(\d+)([MIDNSHP=XFGUV])/g;
while (file.readline(buf) >= 0) {
var m, t = buf.toString().split("\t");
- var ctg_name = null, cigar = null, pos = null, qname = t[0];
+ var ctg_name = null, cigar = null, pos = null, qname;
if (t[0].charAt(0) == '@') continue;
- if (t[4] == '+' || t[4] == '-' || t[4] == '*') { // PAF
+ if (t[0] == "##PAF") t.shift();
+ qname = t[0];
+ if (is_bed) {
+ ctg_name = t[0], pos = parseInt(t[1]), cigar == null;
+ } else if (t[4] == '+' || t[4] == '-' || t[4] == '*') { // PAF
ctg_name = t[5], pos = parseInt(t[7]);
var type = 'P';
for (i = 12; i < t.length; ++i) {
@@ -2445,12 +2459,43 @@ function paf_junceval(args)
}
var intron = [];
- while ((m = re_cigar.exec(cigar)) != null) {
- var len = parseInt(m[1]), op = m[2];
- if (op == 'N') {
- intron.push([pos, pos + len]);
- pos += len;
- } else if (op == 'M' || op == 'X' || op == '=' || op == 'D') pos += len;
+ if (is_bed) {
+ intron.push([pos, parseInt(t[2])]);
+ } else if (aa) {
+ var tmp_junc = [], tmp = 0;
+ while ((m = re_cigar.exec(cigar)) != null) {
+ var len = parseInt(m[1]), op = m[2];
+ if (op == 'N') {
+ tmp_junc.push([tmp, tmp + len]);
+ tmp += len;
+ } else if (op == 'U') {
+ tmp_junc.push([tmp + 1, tmp + len - 2]);
+ tmp += len;
+ } else if (op == 'V') {
+ tmp_junc.push([tmp + 2, tmp + len - 1]);
+ tmp += len;
+ } else if (op == 'M' || op == 'X' || op == '=' || op == 'D') {
+ tmp += len * 3;
+ } else if (op == 'F' || op == 'G') {
+ tmp += len;
+ }
+ }
+ if (t[4] == '+') {
+ for (var i = 0; i < tmp_junc.length; ++i)
+ intron.push([pos + tmp_junc[i][0], pos + tmp_junc[i][1]]);
+ } else if (t[4] == '-') {
+ var glen = parseInt(t[8]) - parseInt(t[7]);
+ for (var i = tmp_junc.length - 1; i >= 0; --i)
+ intron.push([pos + (glen - tmp_junc[i][1]), pos + (glen - tmp_junc[i][0])]);
+ }
+ } else {
+ while ((m = re_cigar.exec(cigar)) != null) {
+ var len = parseInt(m[1]), op = m[2];
+ if (op == 'N') {
+ intron.push([pos, pos + len]);
+ pos += len;
+ } else if (op == 'M' || op == 'X' || op == '=' || op == 'D') pos += len;
+ }
}
if (intron.length == 0) {
++n_sgl;
@@ -2509,6 +2554,276 @@ function paf_junceval(args)
}
}
+function paf_exoneval(args) // adapted from paf_junceval()
+{
+ var c, l_fuzzy = 0, print_ovlp = false, print_err_only = false, first_only = false, chr_only = false, aa = false, is_bed = false, use_cds = false, eval_base = false;
+ while ((c = getopt(args, "l:epcab1ds")) != null) {
+ if (c == 'l') l_fuzzy = parseInt(getopt.arg);
+ else if (c == 'e') print_err_only = print_ovlp = true;
+ else if (c == 'p') print_ovlp = true;
+ else if (c == 'c') chr_only = true;
+ else if (c == 'a') aa = true, use_cds = true;
+ else if (c == 'b') is_bed = true;
+ else if (c == '1') first_only = true;
+ else if (c == 'd') use_cds = true;
+ else if (c == 's') eval_base = true;
+ }
+
+ if (args.length - getopt.ind < 1) {
+ print("Usage: paftools.js exoneval [options] <gene.gtf> <aln.sam>");
+ print("Options:");
+ print(" -l INT tolerance of junction positions (0 for exact) [0]");
+ print(" -d evaluate coding regions only (exon regions by default)");
+ print(" -a miniprot PAF as input (force -d)");
+ print(" -p print overlapping exons");
+ print(" -e print erroreous overlapping exons");
+ print(" -c only consider alignments to /^(chr)?([0-9]+|X|Y)$/");
+ print(" -1 only process the first alignment of each query");
+ print(" -b BED as input");
+ print(" -s compute base Sn and Sp (more memory)");
+ exit(1);
+ }
+
+ var file, buf = new Bytes();
+
+ warn("Reading reference GTF...");
+ var tr = {};
+ file = args[getopt.ind] == '-'? new File() : new File(args[getopt.ind]);
+ while (file.readline(buf) >= 0) {
+ var m, t = buf.toString().split("\t");
+ if (t[0].charAt(0) == '#') continue;
+ if (use_cds) {
+ if (t[2] != "cds" && t[2] != "CDS") continue;
+ } else {
+ if (t[2] != 'exon') continue;
+ }
+ var st = parseInt(t[3]) - 1;
+ var en = parseInt(t[4]);
+ if ((m = /transcript_id "(\S+)"/.exec(t[8])) == null) continue;
+ var tid = m[1];
+ if (tr[tid] == null) tr[tid] = [t[0], t[6], 0, 0, []];
+ tr[tid][4].push([st, en]); // this keeps transcript
+ }
+ file.close();
+
+ var anno = {};
+ for (var tid in tr) { // traverse each transcript
+ var t = tr[tid];
+ Interval.sort(t[4]);
+ t[2] = t[4][0][0];
+ t[3] = t[4][t[4].length - 1][1];
+ if (anno[t[0]] == null) anno[t[0]] = [];
+ var s = t[4];
+ for (var i = 0; i < s.length; ++i) // traverse each exon
+ anno[t[0]].push([s[i][0], s[i][1]]);
+ }
+ tr = null;
+
+ for (var chr in anno) { // index exons
+ var e = anno[chr];
+ if (e.length == 0) continue;
+ Interval.sort(e);
+ var k = 0;
+ for (var i = 1; i < e.length; ++i) // dedup
+ if (e[i][0] != e[k][0] || e[i][1] != e[k][1])
+ e[++k] = e[i].slice(0);
+ e.length = k + 1;
+ Interval.index_end(e);
+ }
+
+ var n_pri = 0, n_unmapped = 0, n_mapped = 0;
+ var n_exon = 0, n_exon_hit = 0, n_exon_novel = 0;
+
+ file = getopt.ind+1 >= args.length || args[getopt.ind+1] == '-'? new File() : new File(args[getopt.ind+1]);
+ var last_qname = null, qexon = {};
+ var re_cigar = /(\d+)([MIDNSHP=XFGUV])/g;
+
+ warn("Evaluating alignments...");
+ while (file.readline(buf) >= 0) {
+ var m, t = buf.toString().split("\t");
+ var ctg_name = null, cigar = null, pos = null, qname;
+
+ if (t[0].charAt(0) == '@') continue;
+ if (t[0] == "##PAF") t.shift();
+ qname = t[0];
+ if (is_bed) {
+ ctg_name = t[0], pos = parseInt(t[1]), cigar == null;
+ } else if (t[4] == '+' || t[4] == '-' || t[4] == '*') { // PAF
+ ctg_name = t[5], pos = parseInt(t[7]);
+ var type = 'P';
+ for (i = 12; i < t.length; ++i) {
+ if ((m = /^(tp:A|cg:Z):(\S+)/.exec(t[i])) != null) {
+ if (m[1] == 'tp:A') type = m[2];
+ else cigar = m[2];
+ }
+ }
+ if (type == 'S') continue; // secondary
+ } else { // SAM
+ ctg_name = t[2], pos = parseInt(t[3]) - 1, cigar = t[5];
+ var flag = parseInt(t[1]);
+ if (flag&0x100) continue; // secondary
+ }
+
+ if (chr_only && !/^(chr)?([0-9]+|X|Y)$/.test(ctg_name)) continue;
+ if (first_only && last_qname == qname) continue;
+ if (ctg_name == '*') { // unmapped
+ ++n_unmapped;
+ continue;
+ } else {
+ ++n_pri;
+ if (last_qname != qname) {
+ ++n_mapped;
+ last_qname = qname;
+ }
+ }
+
+ var exon = [];
+ if (is_bed) { // BED
+ exon.push([pos, parseInt(t[2])]);
+ } else if (aa) {
+ var tmp_exon = [], tmp = 0, tmp_st = 0;
+ while ((m = re_cigar.exec(cigar)) != null) {
+ var len = parseInt(m[1]), op = m[2];
+ if (op == 'N') {
+ tmp_exon.push([tmp_st, tmp]);
+ tmp_st = tmp + len, tmp += len;
+ } else if (op == 'U') {
+ tmp_exon.push([tmp_st, tmp + 1]);
+ tmp_st = tmp + len - 2, tmp += len;
+ } else if (op == 'V') {
+ tmp_exon.push([tmp_st, tmp + 2]);
+ tmp_st = tmp + len - 1, tmp += len;
+ } else if (op == 'M' || op == 'X' || op == '=' || op == 'D') {
+ tmp += len * 3;
+ } else if (op == 'F' || op == 'G') {
+ tmp += len;
+ }
+ }
+ tmp_exon.push([tmp_st, tmp]);
+ if (t[4] == '+') {
+ for (var i = 0; i < tmp_exon.length; ++i)
+ exon.push([pos + tmp_exon[i][0], pos + tmp_exon[i][1]]);
+ } else if (t[4] == '-') { // For protein-to-genome alignment, the coordinates are on the query strand. Need to flip them.
+ var glen = parseInt(t[8]) - parseInt(t[7]);
+ for (var i = tmp_exon.length - 1; i >= 0; --i)
+ exon.push([pos + (glen - tmp_exon[i][1]), pos + (glen - tmp_exon[i][0])]);
+ }
+ } else {
+ var tmp_st = pos;
+ while ((m = re_cigar.exec(cigar)) != null) {
+ var len = parseInt(m[1]), op = m[2];
+ if (op == 'N') {
+ exon.push([tmp_st, pos]);
+ tmp_st = pos + len, pos += len;
+ } else if (op == 'M' || op == 'X' || op == '=' || op == 'D') pos += len;
+ }
+ exon.push([tmp_st, pos]);
+ }
+ n_exon += exon.length;
+
+ var chr = anno[ctg_name];
+ if (chr != null) {
+ for (var i = 0; i < exon.length; ++i) {
+ if (eval_base) {
+ if (qexon[ctg_name] == null) qexon[ctg_name] = [];
+ qexon[ctg_name].push([exon[i][0], exon[i][1]]);
+ }
+ var o = Interval.find_ovlp(chr, exon[i][0], exon[i][1]);
+ if (o.length > 0) {
+ var hit = false;
+ for (var j = 0; j < o.length; ++j) {
+ var st_diff = exon[i][0] - o[j][0];
+ var en_diff = exon[i][1] - o[j][1];
+ if (st_diff < 0) st_diff = -st_diff;
+ if (en_diff < 0) en_diff = -en_diff;
+ if (st_diff <= l_fuzzy && en_diff <= l_fuzzy)
+ ++n_exon_hit, hit = true;
+ if (hit) break;
+ }
+ if (print_ovlp) {
+ var type = hit? 'C' : 'P';
+ if (hit && print_err_only) continue;
+ var x = '[';
+ for (var j = 0; j < o.length; ++j) {
+ if (j) x += ', ';
+ x += '(' + o[j][0] + "," + o[j][1] + ')';
+ }
+ x += ']';
+ print(type, qname, i+1, ctg_name, exon[i][0], exon[i][1], x);
+ }
+ } else {
+ ++n_exon_novel;
+ if (print_ovlp)
+ print('N', qname, i+1, ctg_name, exon[i][0], exon[i][1]);
+ }
+ }
+ } else {
+ n_exon_novel += exon.length;
+ }
+ }
+ file.close();
+
+ buf.destroy();
+
+ if (!print_ovlp) {
+ print("# unmapped reads: " + n_unmapped);
+ print("# mapped reads: " + n_mapped);
+ print("# primary alignments: " + n_pri);
+ print("# predicted exons: " + n_exon);
+ print("# non-overlapping exons: " + n_exon_novel);
+ print("# correct exons: " + n_exon_hit + " (" + (n_exon_hit / n_exon * 100).toFixed(2) + "%)");
+ }
+
+ function merge_and_index(ex) {
+ for (var chr in ex) {
+ var a = [];
+ e = ex[chr];
+ Interval.sort(e);
+ var st = e[0][0], en = e[0][1];
+ for (var i = 1; i < e.length; ++i) { // merge
+ if (e[i][0] > en) {
+ a.push([st, en]);
+ st = e[i][0], en = e[i][1];
+ } else {
+ en = en > e[i][1]? en : e[i][1];
+ }
+ }
+ a.push([st, en]);
+ Interval.index_end(a);
+ ex[chr] = a;
+ }
+ }
+
+ function cal_sn(a0, a1) {
+ var tot = 0, cov = 0;
+ for (var chr in a1) {
+ var e0 = a0[chr], e1 = a1[chr];
+ for (var i = 0; i < e1.length; ++i)
+ tot += e1[i][1] - e1[i][0];
+ if (e0 == null) continue;
+ for (var i = 0; i < e1.length; ++i) {
+ var o = Interval.find_ovlp(e0, e1[i][0], e1[i][1]);
+ for (var j = 0; j < o.length; ++j) { // this only works when there are no overlaps between intervals
+ var st = e1[i][0] > o[j][0]? e1[i][0] : o[j][0];
+ var en = e1[i][1] < o[j][1]? e1[i][1] : o[j][1];
+ cov += en - st;
+ }
+ }
+ }
+ return [tot, cov];
+ }
+
+ if (eval_base) {
+ warn("Computing base Sn and Sp...");
+ merge_and_index(qexon);
+ merge_and_index(anno);
+ var sn = cal_sn(qexon, anno);
+ var sp = cal_sn(anno, qexon);
+ print("Base Sn: " + sn[1] + " / " + sn[0] + " = " + (sn[1] / sn[0] * 100).toFixed(2) + "%");
+ print("Base Sp: " + sp[1] + " / " + sp[0] + " = " + (sp[1] / sp[0] * 100).toFixed(2) + "%");
+ }
+}
+
// evaluate overlap sensitivity
function paf_ov_eval(args)
{
@@ -2704,6 +3019,23 @@ function paf_misjoin(args)
return len < (en - st) * cen_ratio? false : true;
}
+ function test_cen_point(cen, chr, x) {
+ var b = cen[chr];
+ if (b == null) return false;
+ for (var j = 0; j < b.length; ++j)
+ if (x >= b[j][0] && x < b[j][1])
+ return true;
+ return false;
+ }
+
+ if (show_err || show_long) {
+ print("C\tJ inter-chromosomal misjoin");
+ print("C\tj inter-chromosomal misjoin with both breakpoints ending in centromeres");
+ print("C\tG long gap on the reference genome");
+ print("C\tg long gap on the reference genome with both breakpoints ending in centromeres");
+ print("C\tM closed inversion");
+ print("C");
+ }
function process(a) {
var k = 0;
for (var i = 0; i < a.length; ++i) {
@@ -2716,14 +3048,17 @@ function paf_misjoin(args)
a = a.sort(function(x,y){return x[2]-y[2]});
if (show_long) for (var i = 0; i < a.length; ++i) print(a[i].join("\t"));
for (var i = 1; i < a.length; ++i) {
- var ov = [false, false];
+ var ov = [false, false], end_cen = [false, false];
ov[0] = test_cen(cen, a[i-1][5], a[i-1][7], a[i-1][8]);
ov[1] = test_cen(cen, a[i][5], a[i][7], a[i][8]);
+ end_cen[0] = test_cen_point(cen, a[i-1][5], a[i-1][4] == '+'? a[i-1][8] : a[i-1][7]);
+ end_cen[1] = test_cen_point(cen, a[i][5], a[i][4] == '+'? a[i][7] : a[i][8]);
if (a[i-1][5] != a[i][5]) { // different chr
if (ov[0] || ov[1]) ++n_diff[1];
else if (show_err) {
- print("J", a[i-1].slice(0, 12).join("\t"));
- print("J", a[i].slice(0, 12).join("\t"));
+ var label = end_cen[0] && end_cen[1]? 'j' : 'J';
+ print(label, a[i-1].slice(0, 12).join("\t"));
+ print(label, a[i].slice(0, 12).join("\t"));
}
++n_diff[0];
} else if (a[i-1][4] == a[i][4]) { // a gap
@@ -2733,8 +3068,9 @@ function paf_misjoin(args)
if (gap > max_gap) {
if (ov[0] || ov[1]) ++n_gap[1];
else if (show_err) {
- print("G", a[i-1].slice(0, 12).join("\t"));
- print("G", a[i].slice(0, 12).join("\t"));
+ var label = end_cen[0] && end_cen[1]? 'g' : 'G';
+ print(label, a[i-1].slice(0, 12).join("\t"));
+ print(label, a[i].slice(0, 12).join("\t"));
}
++n_gap[0];
}
@@ -3084,6 +3420,183 @@ function paf_pafcmp(args)
buf.destroy();
}
+function paf_longcs2seq(args) {
+ var c, opt = { query:false };
+ while ((c = getopt(args, "q")) != null)
+ if (c == 'q') opt.query = true;
+ if (args.length == getopt.ind) {
+ print("Usage: paftools.js longcs2seq [-q] <long-cs.paf>");
+ return;
+ }
+ var re_cs = /([:=*+-])(\d+|[A-Za-z]+)/g
+ var buf = new Bytes();
+ var file = args[getopt.ind] == "-"? new File() : new File(args[getopt.ind]);
+ while (file.readline(buf) >= 0) {
+ var m, cs = null, t = buf.toString().split("\t");
+ for (var i = 12; i < t.length; ++i)
+ if ((m = /^cs:Z:(\S+)/.exec(t[i])) != null) {
+ cs = m[1];
+ break;
+ }
+ if (cs == null) continue;
+ var ts = "", qs = "";
+ while ((m = re_cs.exec(cs)) != null) {
+ if (m[1] == "=") ts += m[2], qs += m[2];
+ else if (m[1] == "+") qs += m[2].toUpperCase();
+ else if (m[1] == "-") ts += m[2].toUpperCase();
+ else if (m[1] == "*") ts += m[2][0].toUpperCase(), qs += m[2][1].toUpperCase();
+ else if (m[1] == ":") throw Error("Long cs is required");
+ }
+ if (opt.query) {
+ print(">" + t[0] + "_" + t[2] + "_" + t[3]);
+ print(qs);
+ } else {
+ print(">" + t[5] + "_" + t[7] + "_" + t[8]);
+ print(ts);
+ }
+ }
+ file.close();
+ buf.destroy();
+}
+
+function paf_paf2gff(args) {
+ var c, opt = { aa:false };
+ var re_cigar = /(\d+)([A-Z=])/g;
+ while ((c = getopt(args, "a")) != null) {
+ if (c == 'a') opt.aa = true;
+ }
+ if (args.length == getopt.ind) {
+ print("Usage: paftools.js paf2gff [-a] <in.paf>");
+ return;
+ }
+ var buf = new Bytes();
+ var file = args[getopt.ind] == '-'? new File() : new File(args[getopt.ind]);
+ var hid = 1, last_name = null;
+ while (file.readline(buf) >= 0) {
+ var m, t = buf.toString().split("\t");
+ if (t[5] == '*') continue; // skip unmapped lines
+
+ if (t[0] != last_name) last_name = t[0], hid = 1;
+ else ++hid;
+ for (var i = 1; i <= 3; ++i) t[i] = parseInt(t[i]);
+ for (var i = 6; i <= 11; ++i) t[i] = parseInt(t[i]);
+ var cigar = null, score = null, np = null, dist_stop = null, dist_start = null;
+ for (var i = 12; i < t.length; ++i) {
+ if ((m = /^(cg:Z|AS:i|np:i|da:i|do:i):(\S+)/.exec(t[i])) != null) {
+ if (m[1] == 'cg:Z') cigar = m[2];
+ else if (m[1] == 'AS:i') score = parseInt(m[2]);
+ else if (m[1] == 'np:i') np = parseInt(m[2]);
+ else if (m[1] == 'do:i') dist_stop = parseInt(m[2]);
+ else if (m[1] == 'da:i') dist_start = parseInt(m[2]);
+ }
+ }
+ if (cigar == null) throw Error("failed to find the cg:Z tag");
+ if (score == null) throw Error("failed to find the AS:i tag");
+
+ var st = 0, en = 0, phase = 0, pseudo = false, fs = 0, a = [];
+ if (dist_start != null && dist_start == 0)
+ a.push([t[5], 'paf2gff', 'start_codon', 0, 3, 0, t[4], '.', 0]);
+ while ((m = re_cigar.exec(cigar)) != null) {
+ var len = parseInt(m[1]);
+ if (m[2] == 'M' || m[2] == 'D') {
+ en += opt.aa? len * 3 : len;
+ } else if (m[2] == 'F' || m[2] == 'G' || m[2] == 'R') {
+ en += len, pseudo = true, fs = 1;
+ } else if (m[2] == 'N') {
+ a.push([t[5], 'paf2gff', 'exon', st, en, 0, t[4], phase, fs]);
+ st = en + len, en += len, phase = 0, fs = 0;
+ } else if (m[2] == 'U') { // ...xGT...AGxx...
+ a.push([t[5], 'paf2gff', 'exon', st, en + 1, 0, t[4], phase, fs]);
+ st = en + len - 2, en += len, phase = 2, fs = 0;
+ } else if (m[2] == 'V') { // ...xxGT...AGx...
+ a.push([t[5], 'paf2gff', 'exon', st, en + 2, 0, t[4], phase, fs]);
+ st = en + len - 1, en += len, phase = 1, fs = 0;
+ }
+ }
+ a.push([t[5], 'paf2gff', 'exon', st, en, 0, t[4], phase, fs]);
+ if (en != t[8] - t[7]) throw Error("inconsistent cigar");
+ if (dist_stop != null && dist_stop == 0)
+ a.push([t[5], 'paf2gff', 'stop_codon', en, en + 3, 0, t[4], '.', 0]);
+ var type = pseudo? 'pseudogene' : 'protein_coding';
+ var attr = ['transcript_id=' + t[0] + '#' + hid, 'transcript_type=' + type].join(";");
+ var trans_attr = 'identity=' + (t[9] / t[10]).toFixed(4);
+ if (np != null) trans_attr += ';positive=' + (np * 3 / t[10]).toFixed(4);
+ trans_attr += ';aa_start=' + t[2];
+ trans_attr += ';aa_end=' + (t[1] - t[3]);
+ if (dist_start != null && dist_start >= 0) trans_attr += ';dist_start_codon=' + dist_start;
+ if (dist_stop != null && dist_stop >= 0) trans_attr += ';dist_stop_codon=' + dist_stop;
+ var trans_st = t[7], trans_en = t[8];
+ if (dist_stop != null && dist_stop == 0) {
+ if (t[4] == '-') trans_st -= 3;
+ else trans_en += 3;
+ }
+ print([t[5], 'paf2gff', 'transcript', trans_st + 1, trans_en, score, t[4], '.', attr + ';' + trans_attr].join("\t"));
+ if (opt.aa && t[4] == '-') {
+ var b = [], len = t[8] - t[7];
+ for (var i = a.length - 1; i >= 0; --i) {
+ var x = len - a[i][3];
+ a[i][3] = len - a[i][4];
+ a[i][4] = x;
+ //a[i][7] = a[i][7] == 0? 0 : 3 - a[i][7]; // not sure if this line is needed
+ b.push(a[i]);
+ }
+ a = b;
+ }
+ for (var i = 0; i < a.length; ++i) {
+ if (!pseudo && a[i][2] == "exon") a[i][2] = "CDS";
+ a[i][3] += t[7] + 1;
+ a[i][4] += t[7];
+ a[i][8] = attr + ";frameshift=" + a[i][8];
+ print(a[i].join("\t"));
+ }
+ }
+ file.close();
+ buf.destroy();
+}
+
+function paf_gff2junc(args) {
+ var c, feat = "CDS";
+ while ((c = getopt(args, "f:")) != null) {
+ if (c == 'f') feat = getopt.arg;
+ }
+ if (getopt.ind == args.length) {
+ print("Usage: paftools.js gff2junc [-f feature] <in.gff3>");
+ return;
+ }
+ var buf = new Bytes();
+ var file = args[getopt.ind] == "-"? new File() : new File(args[getopt.ind]);
+
+ function process_a(a) {
+ if (a.length < 2) return;
+ a = a.sort(function(x, y) { return x[4] - y[4] });
+ for (var i = 1; i < a.length; ++i)
+ print([a[i][1], a[i-1][5], a[i][4], a[i][0], 0, a[i][7]].join("\t"));
+ }
+
+ var a = [];
+ while (file.readline(buf) >= 0) {
+ var m, t = buf.toString().split("\t");
+ if (t[0][0] == '#') continue;
+ if (t[2].toLowerCase() != feat.toLowerCase()) continue;
+ //print(t.join("\t"));
+ if ((m = /\bParent=([^;]+)/.exec(t[8])) == null) {
+ warn("Can't find Parent");
+ continue;
+ }
+ t[3] = parseInt(t[3]) - 1;
+ t[4] = parseInt(t[4]);
+ t.unshift(m[1]);
+ if (a.length > 0 && a[0][0] != m[1]) {
+ process_a(a);
+ a.length = 0;
+ a.push(t);
+ } else a.push(t);
+ }
+ process_a(a);
+ file.close();
+ buf.destroy();
+}
+
/*************************
***** main function *****
*************************/
@@ -3098,6 +3611,9 @@ function main(args)
print(" sam2paf convert SAM to PAF");
print(" delta2paf convert MUMmer's delta to PAF");
print(" gff2bed convert GTF/GFF3 to BED12");
+ print(" gff2junc convert GFF3 to junction BED");
+ print(" longcs2seq convert long-cs PAF to sequences");
+// print(" paf2gff convert PAF to GFF3 (tested for miniprot only)");
print("");
print(" stat collect basic mapping information in PAF/SAM");
print(" asmstat collect basic assembly information");
@@ -3115,6 +3631,7 @@ function main(args)
print(" mason2fq convert mason2-simulated SAM to FASTQ");
print(" pbsim2fq convert PBSIM-simulated MAF to FASTQ");
print(" junceval evaluate splice junction consistency with known annotations");
+ print(" exoneval evaluate exon-level consistency with known annotations");
print(" ov-eval evaluate read overlap sensitivity using read-to-ref mapping");
exit(1);
}
@@ -3125,6 +3642,7 @@ function main(args)
else if (cmd == 'delta2paf') paf_delta2paf(args);
else if (cmd == 'splice2bed') paf_splice2bed(args);
else if (cmd == 'gff2bed') paf_gff2bed(args);
+ else if (cmd == 'gff2junc') paf_gff2junc(args);
else if (cmd == 'stat') paf_stat(args);
else if (cmd == 'asmstat') paf_asmstat(args);
else if (cmd == 'asmgene') paf_asmgene(args);
@@ -3138,10 +3656,13 @@ function main(args)
else if (cmd == 'mason2fq') paf_mason2fq(args);
else if (cmd == 'pbsim2fq') paf_pbsim2fq(args);
else if (cmd == 'junceval') paf_junceval(args);
+ else if (cmd == 'exoneval') paf_exoneval(args);
else if (cmd == 'ov-eval') paf_ov_eval(args);
else if (cmd == 'vcfstat') paf_vcfstat(args);
else if (cmd == 'sveval') paf_sveval(args);
else if (cmd == 'vcfsel') paf_vcfsel(args);
+ else if (cmd == 'longcs2seq') paf_longcs2seq(args);
+ else if (cmd == 'paf2gff') paf_paf2gff(args);
else if (cmd == 'version') print(paftools_version);
else throw Error("unrecognized command: " + cmd);
}
=====================================
options.c
=====================================
@@ -8,7 +8,7 @@ void mm_idxopt_init(mm_idxopt_t *opt)
opt->k = 15, opt->w = 10, opt->flag = 0;
opt->bucket_bits = 14;
opt->mini_batch_size = 50000000;
- opt->batch_size = 4000000000ULL;
+ opt->batch_size = 8000000000ULL;
}
void mm_mapopt_init(mm_mapopt_t *opt)
=====================================
pyproject.toml
=====================================
@@ -0,0 +1,2 @@
+[build-system]
+requires = ["setuptools", "wheel", "Cython"]
=====================================
python/mappy.pyx
=====================================
@@ -3,7 +3,7 @@ from libc.stdlib cimport free
cimport cmappy
import sys
-__version__ = '2.24'
+__version__ = '2.26'
cmappy.mm_reset_timer()
@@ -172,6 +172,7 @@ cdef class Aligner:
cdef cmappy.mm_mapopt_t map_opt
if self._idx == NULL: return
+ if ((self.map_opt.flag & 4) and (self._idx.flag & 2)): return
map_opt = self.map_opt
if max_frag_len is not None: map_opt.max_frag_len = max_frag_len
if extra_flags is not None: map_opt.flag |= extra_flags
@@ -217,6 +218,7 @@ cdef class Aligner:
cdef int l
cdef char *s
if self._idx == NULL: return
+ if ((self.map_opt.flag & 4) and (self._idx.flag & 2)): return
s = cmappy.mappy_fetch_seq(self._idx, name.encode(), start, end, &l)
if l == 0: return None
r = s[:l] if isinstance(s, str) else s[:l].decode()
=====================================
seed.c
=====================================
@@ -7,7 +7,7 @@ void mm_seed_mz_flt(void *km, mm128_v *mv, int32_t q_occ_max, float q_occ_frac)
mm128_t *a;
size_t i, j, st;
if (mv->n <= q_occ_max || q_occ_frac <= 0.0f || q_occ_max <= 0) return;
- KMALLOC(km, a, mv->n);
+ a = Kmalloc(km, mm128_t, mv->n);
for (i = 0; i < mv->n; ++i)
a[i].x = mv->a[i].x, a[i].y = i;
radix_sort_128x(a, a + mv->n);
=====================================
setup.py
=====================================
@@ -23,7 +23,7 @@ def readme():
setup(
name = 'mappy',
- version = '2.24',
+ version = '2.26',
url = 'https://github.com/lh3/minimap2',
description = 'Minimap2 python binding',
long_description = readme(),
View it on GitLab: https://salsa.debian.org/med-team/minimap2/-/commit/1f95290ff9d3d0936daa700dbfb6a77d56c7333c
--
View it on GitLab: https://salsa.debian.org/med-team/minimap2/-/commit/1f95290ff9d3d0936daa700dbfb6a77d56c7333c
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20240229/e9a9a563/attachment-0001.htm>
More information about the debian-med-commit
mailing list