[med-svn] [Git][med-team/staden-io-lib][upstream] New upstream version 1.15.0
Étienne Mollier (@emollier)
gitlab at salsa.debian.org
Sat Jan 20 19:35:51 GMT 2024
Étienne Mollier pushed to branch upstream at Debian Med / staden-io-lib
Commits:
7bb8453e by Étienne Mollier at 2024-01-20T18:48:25+01:00
New upstream version 1.15.0
- - - - -
9 changed files:
- CHANGES
- README.md
- configure.ac
- io_lib/cram_encode.c
- progs/scramble.c
- tests/data/xx#MD.sam
- tests/data/xx#MD2.sam
- tests/data/xx#rg.sam
- tests/data/xx.fa
Changes:
=====================================
CHANGES
=====================================
@@ -1,3 +1,21 @@
+Version 1.15.0 (14th April 2023)
+--------------
+
+Version number bumped to reflect the official status of CRAM 3.1.
+
+Updates:
+
+* Formally accept CRAM 3.1 as an official standard. Warning removed.
+ For best compatibility CRAM 3.0 is still the default CRAM, but use
+ "-V3.1" to specify the version.
+
+* Updated to latest htscodecs. This has a significant speed
+ improvement in encoding with fqzcomp (enabled in "-X small" profile).
+
+ Tested on a NovaSeq dataset, encoding from BAM to CRAM was 27% faster.
+ Decoding a CRAM with fqzcomp is also around 6% faster.
+
+
Version 1.14.15 (6th December 2022)
---------------
=====================================
README.md
=====================================
@@ -1,5 +1,5 @@
-Io_lib: Version 1.14.15
-========================
+Io_lib: Version 1.15.0
+=======================
Io_lib is a library of file reading and writing code to provide a general
purpose SAM/BAM/CRAM, trace file (and Experiment File) reading
@@ -33,131 +33,30 @@ See the CHANGES for a summary of older updates or git logs for the
full details.
-Version 1.14.15 (6th December 2022)
----------------
+Version 1.15.0 (14th April 2023)
+--------------
-This is primarily a bug fix release.
+The first release that no longer warns about CRAM 3.1 being draft.
+No changes have been made to the format and it is fully compatible
+with the 1.14.x releases.
-Version 1.14.14 (17th March 2021)
----------------
+Technology Demo: 4.0
+====================
-This is simply a bug fix release. It also updates to the latest
-htscodecs submodule, now at an official 1.0 release.
+The current official GA4GH CRAM version is 3.1.
-Version 1.14.13 (3rd July 2020)
----------------
+The current default CRAM output is 3.0, for maximum compatibility with
+other tools. Use the -V3.1 option to select CRAM 3.1 if needed.
-This release has a mixture of on-going CRAM 4 work (not compatible
-with previous CRAM 4) and some more general quality of life
-improvements for all CRAM versions including speed-ups and better
-multi-threading.
-
-Note both CRAM 3.1 and 4.0 are still to be considered an unofficial
-CRAM extensions.
-
-Updates:
-
-* Scramble can now filter-in or filter-out aux tags during
- transcoding. This is done using -d and -D options. For example:
-
- scramble -D OQ,BI,BD in.bam out.cram
-
- removes the GATK added OQ, BI and BD aux tags.
- Requested by @jhaezebrouck in issue #24.
-
-* The Scramble -X <profile> options are now implemented using a
- CRAM_OPT_PROFILE option. This simplifies the scramble code and
- makes it easier to call from a library. This also fixes a number of
- bugs in the order of argument parsing.
-
-* Improved CRAM writing speeds.
-
- The bam_copy function now only copies the number of used bytes
- rather than the number of allocated bytes, which can sometimes be
- substantially smaller. As this was done in the main thread it may
- have a significant benefit when multi-threading.
-
-* Added libdeflate support into CRAM too (in addition to the existing
- support in BAM). This isn't a huge change to CRAM speeds except at
- high levels (-8 and -9) which are now slower, but also better
- compression ratio. A modest 2-3% speed gain is visible are low and
- mid levels, and at -1/-2 to -4 the compression ratio is also
- improved.
-
-* CRAM 3.1 compression level -1 is now 25% faster, but 4% larger.
- This is achieved by difference choice of compression codecs, most
- notably disabling the name tokeniser for level 1. Use level 2 for
- something comparable to the old behaviour.
-
-* Added an io_lib/version.h to make it easier to detect the version
- being compiled against using IOLIB_VERSION macros.
- Requested by German Tischler in issue #25.
-
-* Refactored the cram encoding interface used by biobambam.
- Implemented by German Tischler in PR#27.
-
-* CRAM 4 now uses E_CONST instead of a uni-value version of
- E_HUFFMAN. Also added offset field to VARINT_SIGNED and
- VARINT_UNSIGNED which helps for data series that have values from -1
- to MAXINT.
-
-* CRAM 4 container structure has changed so that all values are
- variable sized integers instead of fixed size.
-
-* Further improvements with CRAM 4's use of signed values.
- - Ref_seq_id is container and slice headers are now signed.
- - RI (ref ID) data series and NS (mate ref ID) are also now signed
- as -1 is a valid value.
- - Embedded ref id is now 0 for unusued instead of -1.
-
-* Reversed the use of CRAM 4 delta encoding for the B array. It only
- helps at the moment for ONT signal data, so it needs more work to
- make it auto-detect when delta makes sense. (Enabling it globally
- for CRAM4 B aux tags was accidental.)
-
-* Htscodecs submodule has gained support for big-endian platforms
- Other big-endian improvements to parts of CRAM4 too.
-
-Bug fixes:
-
-* Fixed CRAM MD tag generatin when using the "b" feature code
- (NB: unused by known CRAM encoders).
- Also see https://github.com/samtools/htslib/pull/1086 for more details.
-
-* Fixed CRAM quality string when using "q" feature code (unused by
- encoders?) and in lossy-quality mode (maybe utilised in old
- Cramtools).
- Also see https://github.com/samtools/htslib/pull/1094 for more details.
-
-* Fixed some minor memory leaks.
-
-* "Scramble -X archive -1" enabled lzma, which should only have
- arrived at level 7 and above. (It compared integer 7 vs ASCII '1'.)
-
-* Removed minor compilation warning in printf debugging.
-
-* Fixed a 7 year old bug in scram_pileup which couldn't cope with
- soft-clips being followed by hard-clips.
-
-
-Technology Demo: CRAM 3.1 and 4.0
-=================================
-
-The current official GA4GH CRAM version is 3.0.
-
-For purposes of *EVALUATION ONLY* this release of io_lib includes CRAM
-version 3.1, with new compression codecs (but is otherwise identical
-file layout to 3.0), and 4.0 with a few additional format
+For purposes of *EVALUATION ONLY* this release of io_lib also includes
+an experimental CRAM version 4.0. The format very likely to change
+and should not be used for production data. CRAM 4.0 includes format
modifications, such as 64-bit sizes, deduplication of read names,
orientation changes of quality strings and a revised variable sized
-integer encoding.
+integer encoding. It can be enabled using scramble -V4.0
-They can be turned on using e.g. scramble -V3.1 or scramble -V4.0.
-It is likely CRAM v4.0 will be official significantly later, but we
-plan on v3.1 being a recognised GA4GH standard this year.
-
-By default enabling either of these will also enable the new codecs.
+Enabling CRAM 3.1 or 4.0 will also enable the new codecs.
Which codecs are used also depends on the profile specified (eg via
"-X small"). Some of the new codecs are considerably slower,
especially at decompression, but by default CRAM 3.1 aims to be
@@ -167,79 +66,37 @@ small and archive respectively).
Here are some example file sizes and timings with different codecs and
levels on 10 million 150bp NovaSeq reads, single threaded. Decode
-timing is checked using "scram_flagstat -b". Tests were performed
-on an Intel i5-4570 processor at 3.2GHz.
+timing is checked using "scram_flagstat -b".
+
+Table produced with Io_lib 1.15.0 on a laptop with Intel i7-1185G7
+CPU running Ubuntu 20.04 under Microsoft's WSL2.
|Scramble opts. |Size(MB) |Enc(s)|Dec(s)|Codecs used |
|--------------------|--------:|-----:|-----:|---------------------------|
-|-O bam | 531.9| 92.3| 7.5|bgzf(zlib) |
-|-O bam -1 | 611.4| 26.4| 5.4|bgzf(libdeflate) |
-|-O bam (default) | 539.5| 45.0| 4.9|bgzf(libdeflate) |
-|-O bam -9 | 499.5| 920.2| 4.9|bgzf(libdeflate) |
-||||||
-|-V2.0 -X fast | 317.7| 38.8| 11.8|(default, level 1) |
-|-V2.0 (default) | 267.6| 47.0| 10.5|(default) |
-|-V2.0 -X small | 218.0| 124.6| 33.1|bzip2 |
-||||||
-|-V3.0 -X fast | 264.9| 31.3| 10.8|(default, level 1) |
-|-V3.0 (default) | 223.7| 34.7| 10.3|(default) |
-|-V3.0 -X small | 212.3| 88.3| 18.2|bzip2 |
-|-V3.0 -X archive | 209.4| 98.7| 18.2|bzip2 |
-||||||
-|-V3.1 -X fast | 262.4| 29.1| 9.3|rANS++ |
-|-V3.1 (default) | 186.4| 33.7| 8.3|rANS++,tok3 |
-|-V3.1 -X small | 176.8| 74.0| 35.2|rANS++,tok3,fqz |
-|-V3.1 -X archive | 171.9| 127.9| 34.9|rANS++,tok3,fqz,bzip2,arith|
-||||||
-|-V4.0 -X fast | 251.2| 28.9| 9.6|rANS++ |
-|-V4.0 (default) | 182.1| 32.9| 8.2|rANS++,tok3 |
-|-V4.0 -X small | 170.9| 70.9| 35.0|rANS++,tok3,fqz |
-|-V4.0 -X archive | 166.9| 116.4| 34.2|rANS++,tok3,fqz,bzip2,arith|
-
-We also tested on a small human aligned HiSeq run (ERR317482)
-representing older Illumina data with pre-binning era quality values.
-This dataset shows less impressive gains with 4.0 over 3.0 in the
-default profile, but major gains in small profile once fqzcomp quality
-encoding is enabled.
-
-Note for this file, the file sizes are larger meaning less disk
-caching is possible (the test machine wasn't a memory stressed
-desktop). Threading was also enabled, albeit with just 4 threads,
-which further exacerbates I/O bottlenecks. The previous test
-demonstrated BAM being faster to read than CRAM, but with large files
-in a more I/O stressed situation this test demonstrates the default
-profile of CRAM is faster to read than BAM, due to the smaller I/O
-footprint.
-
-NB: the table below was produced with 1.14.12.
-
-|Scramble opts. |Size(MB) |Enc(s)|Dec(s)|Codecs used |
-|-------------------- |--------:|-----:|-----:|--------------------------------|
-|-t4 -O bam (default) | 6526 | 115.4| 44.7|bgzf(libdeflate) |
+|-O bam (default) | 518.2| 65.8| 5.7|bgzf(zlib) |
+|-O bam -1 | 584.5| 17.4| 3.5|bgzf(libdeflate) |
+|-O bam (default) | 524.6| 27.8| 2.9|bgzf(libdeflate) |
+|-O bam -9 | 486.5| 810.4| 3.0|bgzf(libdeflate) |
||||||
-|-t4 -V2.0 -X fast | 3674 | 87.4| 31.4|(default, level 1) |
-|-t4 -V2.0 (default) | 3435 | 91.4| 30.7|(default) |
-|-t4 -V2.0 -X small | 3373 | 145.5| 47.8|bzip2 |
-|-t4 -V2.0 -X archive | 3377 | 166.3| 49.7|bzip2 |
-|-t4 -V2.0 -X archive -9| 3125 |1900.6| 76.9|bzip2 |
+|-V2.0 -X fast | 294.5| 23.1| 7.8|(default, level 1) |
+|-V2.0 (default) | 252.3| 32.9| 8.0|(default) |
+|-V2.0 -X small | 208.0| 85.2| 23.5|bzip2 |
+|-V2.0 -X archive | 206.0| 88.1| 24.3|bzip2 |
||||||
-|-t4 -V3.0 -X fast | 3620 | 88.3| 29.3|(default, level 1) |
-|-t4 -V3.0 (default) | 3287 | 90.5| 29.5|(default) |
-|-t4 -V3.0 -X small | 3238 | 128.5| 40.3|bzip2 |
-|-t4 -V3.0 -X archive | 3220 | 164.9| 50.0|bzip2 |
-|-t4 -V3.0 -X archive -9| 3115 |1866.6| 75.2|bzip2, lzma |
+|-V3.0 -X fast | 241.1| 19.7| 8.5|(default, level 1) |
+|-V3.0 (default) | 208.5| 23.0| 8.8|(default) |
+|-V3.0 -X small | 201.7| 60.0| 14.5|bzip2 |
+|-V3.0 -X archive | 199.9| 61.7| 13.6|bzip2 |
||||||
-|-t4 -V3.1 -X fast | 3611 | 87.9| 29.2|rANS++ |
-|-t4 -V3.1 (default) | 3161 | 88.8| 29.7|rANS++,tok3 |
-|-t4 -V3.1 -X small | 2249 | 192.2| 146.1|rANS++,tok3,fqz |
-|-t4 -V3.1 -X archive | 2157 | 235.2| 127.5|rANS++,tok3,fqz,bzip2,arith |
-|-t4 -V3.1 -X archive | 2145 | 480.3| 128.9|rANS++,tok3,fqz,bzip2,arith,lzma|
+|-V3.1 -X fast | 237.1| 22.1| 7.9|rANS++ |
+|-V3.1 (default) | 175.8| 26.7| 8.9|rANS++,tok3 |
+|-V3.1 -X small | 166.9| 47.9| 24.6|rANS++,tok3,fqz |
+|-V3.1 -X archive | 162.2| 72.5| 20.5|rANS++,tok3,fqz,bzip2,arith|
||||||
-|-t4 -V4.0 -X fast | 3551 | 87.8| 29.5|rANS++ |
-|-t4 -V4.0 (default) | 3148 | 88.9| 30.0|rANS++,tok3 |
-|-t4 -V4.0 -X small | 2236 | 189.7| 142.6|rANS++,tok3,fqz |
-|-t4 -V4.0 -X archive | 2139 | 226.7| 127.5|rANS++,tok3,fqz,bzip2,arith |
-|-t4 -V4.0 -X archive -9| 2132 | 453.5| 128.2|rANS++,tok3,fqz,bzip2,arith,lzma|
+|-V4.0 -X fast | 227.5| 16.6| 6.2|rANS++ |
+|-V4.0 (default) | 172.8| 19.7| 6.3|rANS++,tok3 |
+|-V4.0 -X small | 162.3| 34.8| 20.2|rANS++,tok3,fqz |
+|-V4.0 -X archive | 157.9| 82.2| 26.2|rANS++,tok3,fqz,bzip2,arith|
Building
=====================================
configure.ac
=====================================
@@ -1,5 +1,5 @@
dnl Process this file with autoconf to produce a configure script.
-AC_INIT(io_lib, 1.14.15)
+AC_INIT(io_lib, 1.15.0)
IOLIB_VERSION=$PACKAGE_VERSION
IOLIB_VERSION_MAJOR=`expr "$PACKAGE_VERSION" : '\([[0-9]]*\)'`
IOLIB_VERSION_MINOR=`expr "$PACKAGE_VERSION" : '[[0-9]]*\.\([[0-9]]*\)'`
@@ -69,7 +69,7 @@ AX_SUBDIRS_CONFIGURE([htscodecs],[[--disable-shared],[--with-pic]])
# libstaden-read.so.1.1.0
VERS_CURRENT=15
-VERS_REVISION=2
+VERS_REVISION=3
VERS_AGE=1
AC_SUBST(VERS_CURRENT)
AC_SUBST(VERS_REVISION)
=====================================
io_lib/cram_encode.c
=====================================
@@ -535,6 +535,7 @@ static int cram_encode_slice_read(cram_fd *fd,
int32_t i32;
int64_t i64;
unsigned char uc;
+ int explicit_qual = 0;
//fprintf(stderr, "Encode seq %d, %d/%d FN=%d, %s\n", rec, core->byte, core->bit, cr->nfeature, s->name_ds->str + cr->name);
@@ -609,11 +610,6 @@ static int cram_encode_slice_read(cram_fd *fd,
/* Aux tags */
r |= h->codecs[DS_TL]->encode(s, h->codecs[DS_TL], (char *)&cr->TL, 1);
- // qual
- r |= h->codecs[DS_QS]->encode(s, h->codecs[DS_QS],
- (char *)BLOCK_DATA(s->qual_blk) + cr->qual,
- cr->len);
-
// features (diffs)
if (!(cr->flags & BAM_FUNMAP)) {
int prev_pos = 0, j;
@@ -686,6 +682,7 @@ static int cram_encode_slice_read(cram_fd *fd,
uc = f->B.qual;
r |= h->codecs[DS_QS]->encode(s, h->codecs[DS_QS],
(char *)&uc, 1);
+ explicit_qual++;
break;
case 'b':
@@ -700,6 +697,7 @@ static int cram_encode_slice_read(cram_fd *fd,
uc = f->Q.qual;
r |= h->codecs[DS_QS]->encode(s, h->codecs[DS_QS],
(char *)&uc, 1);
+ explicit_qual++;
break;
case 'N':
@@ -736,6 +734,11 @@ static int cram_encode_slice_read(cram_fd *fd,
r |= h->codecs[DS_BA]->encode(s, h->codecs[DS_BA], seq, cr->len);
}
+ // qual
+ r |= h->codecs[DS_QS]->encode(s, h->codecs[DS_QS],
+ (char *)BLOCK_DATA(s->qual_blk) + cr->qual
+ + explicit_qual, cr->len);
+
return r ? -1 : 0;
}
=====================================
progs/scramble.c
=====================================
@@ -184,7 +184,7 @@ static int filter_tags(bam_seq_t *s, char *aux_filter, int keep) {
static void usage(FILE *fp) {
fprintf(fp, " -=- sCRAMble -=- version %s\n", IOLIB_VERSION);
- fprintf(fp, "Author: James Bonfield, Wellcome Trust Sanger Institute. 2013-2022\n\n");
+ fprintf(fp, "Author: James Bonfield, Wellcome Trust Sanger Institute. 2013-2023\n\n");
fprintf(fp, "Usage: scramble [options] [input_file [output_file]]\n");
@@ -504,10 +504,6 @@ int main(int argc, char **argv) {
fprintf(stderr, "\nWARNING: this version of CRAM is not a recognised GA4GH standard.\n"
"Note this CRAM version is a technology demonstration only.\n"
"Future versions of Scramble may not be able to read these files.\n\n");
- } else if (cram_default_version() > 300) {
- fprintf(stderr, "\nWARNING: this version of CRAM has yet to be formally signed off.\n"
- "CRAM 3.1 has multiple implementations that have been cross-validated, but\n"
- "the specification document has not yet been accepted as an official standard.\n\n");
}
if (argc - optind > 2) {
=====================================
tests/data/xx#MD.sam
=====================================
@@ -1,7 +1,7 @@
@SQ SN:xx LN:30
@CO All MD and NM should match the stored values
a 0 xx 6 1 10M * 0 0 AAAAATTTTT * co:Z:no fields
-a 0 xx 6 1 10M * 0 0 AAAAGGTTTT *
+a 0 xx 6 1 11M * 0 0 AAAAGRTTTTT ABCDEFGHIJK
a 0 xx 6 1 10M * 0 0 GAAAATTTTG *
i 0 xx 6 1 5M1I5M * 0 0 AAAAAGTTTTT *
i 0 xx 6 1 5M3I5M * 0 0 AAAAAGGGTTTTT *
@@ -11,12 +11,12 @@ d 0 xx 6 1 5M10D5M * 0 0 AAAAACCCCC *
d 0 xx 6 1 5M10N5M * 0 0 AAAAACCCCC *
sid 0 xx 6 1 1S4M10D5I4M1S * 0 0 AAAAAGGGGGCCCCC *
a 0 xx 6 1 10M * 0 0 AAAAATTTTT * MD:Z:10 NM:i:0 co:Z:correct fields
-a 0 xx 6 1 10M * 0 0 AAAAGGTTTT * MD:Z:4A0T4 NM:i:2
+a 0 xx 6 1 11M * 0 0 AAAAGRTTTTT ABCDEFGHIJK MD:Z:4A0T4Y0 NM:i:3
a 0 xx 6 1 10M * 0 0 GAAAATTTTG * MD:Z:0A8T0 NM:i:2
i 0 xx 6 1 5M1I5M * 0 0 AAAAAGTTTTT * MD:Z:10 NM:i:1
i 0 xx 6 1 5M3I5M * 0 0 AAAAAGGGTTTTT * MD:Z:10 NM:i:3
i 0 xx 6 1 10M2I * 0 0 AAAAATTTTTCC * MD:Z:10 NM:i:2
i 0 xx 6 1 10M2P2I * 0 0 AAAAATTTTTCC * MD:Z:10 NM:i:2
-d 0 xx 6 1 5M10D5M * 0 0 AAAAACCCCC * MD:Z:5^TTTTTTTTTT5 NM:i:10
+d 0 xx 6 1 5M10D5M * 0 0 AAAAACCCCC * MD:Z:5^TTTTTYTTTT5 NM:i:10
d 0 xx 6 1 5M10N5M * 0 0 AAAAACCCCC * MD:Z:10 NM:i:0
-sid 0 xx 6 1 1S4M10D5I4M1S * 0 0 AAAAAGGGGGCCCCC * MD:Z:4^ATTTTTTTTT0T3 NM:i:16
+sid 0 xx 6 1 1S4M10D5I4M1S * 0 0 AAAAAGGGGGCCCCC * MD:Z:4^ATTTTTYTTT0T3 NM:i:16
=====================================
tests/data/xx#MD2.sam
=====================================
@@ -1,7 +1,8 @@
@SQ SN:xx LN:30
@CO All MD and/or NM should differ to the stored values
a 0 xx 6 1 10M * 0 0 AAAAATTTTT * MD:Z:9 NM:i:0 co:Z:MD incorrect fields
-a 0 xx 6 1 10M * 0 0 AAAAGGTTTT * MD:Z:4A0A4 NM:i:2
+a 0 xx 6 1 11M * 0 0 AAAAGGTTTTT * MD:Z:4A0T4Y0 NM:i:2
+a 0 xx 6 1 11M * 0 0 AAAAGGTTTTT * MD:Z:4A0T4N0 NM:i:3
a 0 xx 6 1 10M * 0 0 GAAAATTTTG * MD:Z:0G8T0 NM:i:2
i 0 xx 6 1 5M1I5M * 0 0 AAAAAGTTTTT * MD:Z:11 NM:i:1
i 0 xx 6 1 5M3I5M * 0 0 AAAAAGGGTTTTT * MD:Z:1A1 NM:i:3
=====================================
tests/data/xx#rg.sam
=====================================
@@ -1,5 +1,5 @@
@HD VN:1.4 SO:coordinate
- at SQ SN:xx LN:30 AS:? SP:? UR:? M5:bbf4de6d8497a119dda6e074521643dc
+ at SQ SN:xx LN:30 AS:? SP:? UR:? M5:1224b81d8664d77635e1620d7f2c1523
@RG ID:x1 SM:x1
@RG ID:x2 SM:x2 LB:x PG:foo:bar PI:1111
@PG ID:emacs PN:emacs VN:23.1.1
=====================================
tests/data/xx.fa
=====================================
@@ -1,5 +1,5 @@
>xx
-AAAAAAAAAATTTTTTTTTTCCCCCCCCCC
+AAAAAAAAAATTTTTYTTTTCCCCCCCCCC
>yy
AAAAAAAAAATTTTTTTTTT
View it on GitLab: https://salsa.debian.org/med-team/staden-io-lib/-/commit/7bb8453ebfd223060c4681dac0a9200c7fe078e3
--
View it on GitLab: https://salsa.debian.org/med-team/staden-io-lib/-/commit/7bb8453ebfd223060c4681dac0a9200c7fe078e3
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20240120/763f9fde/attachment-0001.htm>
More information about the debian-med-commit
mailing list