[med-svn] [Git][med-team/python-dnaio][master] 5 commits: New upstream version 1.2.1

Étienne Mollier (@emollier) gitlab at salsa.debian.org
Sat Aug 31 10:43:47 BST 2024



Étienne Mollier pushed to branch master at Debian Med / python-dnaio


Commits:
649d84a4 by Étienne Mollier at 2024-08-31T11:37:53+02:00
New upstream version 1.2.1
- - - - -
ca4221c5 by Étienne Mollier at 2024-08-31T11:37:53+02:00
Update upstream source from tag 'upstream/1.2.1'

Update to upstream version '1.2.1'
with Debian dir 0dd159ac5634c679e1bf1d5b412e31fbd4d1d630
- - - - -
7d72c4ef by Étienne Mollier at 2024-08-31T11:38:50+02:00
d/control: add myself to uploaders.

- - - - -
77d2161e by Étienne Mollier at 2024-08-31T11:39:07+02:00
d/control: declare compliance to standards version 4.7.0.

- - - - -
4004a7ec by Étienne Mollier at 2024-08-31T11:42:06+02:00
Ready for upload to unstable.

- - - - -


11 changed files:

- .github/workflows/ci.yml
- CHANGES.rst
- + CITATION.cff
- README.rst
- debian/changelog
- debian/control
- pyproject.toml
- src/dnaio/ascii_check.h
- src/dnaio/bam.h
- src/dnaio/chunks.py
- tests/test_chunks.py


Changes:

=====================================
.github/workflows/ci.yml
=====================================
@@ -15,9 +15,9 @@ jobs:
         python-version: ["3.10"]
         toxenv: [flake8, black, mypy, docs]
     steps:
-    - uses: actions/checkout at v3
+    - uses: actions/checkout at v4
     - name: Set up Python ${{ matrix.python-version }}
-      uses: actions/setup-python at v4
+      uses: actions/setup-python at v5
       with:
         python-version: ${{ matrix.python-version }}
     - name: Install tox
@@ -31,12 +31,12 @@ jobs:
       github.event.pull_request.head.repo.full_name != github.event.pull_request.base.repo.full_name
     runs-on: ubuntu-latest
     steps:
-    - uses: actions/checkout at v3
+    - uses: actions/checkout at v4
       with:
         fetch-depth: 0  # required for setuptools_scm
     - name: Build sdist and temporary wheel
       run: pipx run build
-    - uses: actions/upload-artifact at v3
+    - uses: actions/upload-artifact at v4
       with:
         name: sdist
         path: dist/*.tar.gz
@@ -51,29 +51,29 @@ jobs:
       matrix:
         os: [ubuntu-latest]
         python-version: ["3.8", "3.9", "3.10", "3.11", "3.12"]
-        compile_flags: [""]
         include:
-        - os: macos-latest
+        - os: macos-13
+          python-version: "3.10"
+        - os: macos-14
           python-version: "3.10"
         - os: windows-latest
           python-version: "3.10"
         - os: ubuntu-latest
           python-version: "3.10"
-          compile_flags: "-mssse3"
     steps:
-    - uses: actions/checkout at v3
+    - uses: actions/checkout at v4
     - name: Set up Python ${{ matrix.python-version }}
-      uses: actions/setup-python at v4
+      uses: actions/setup-python at v5
       with:
         python-version: ${{ matrix.python-version }}
     - name: Install tox
       run: python -m pip install tox
     - name: Test
       run: tox -e py
-      env:
-        CFLAGS: ${{ matrix.compile_flags }}
     - name: Upload coverage report
-      uses: codecov/codecov-action at v3
+      uses: codecov/codecov-action at v4
+      with:
+        token: ${{ secrets.CODECOV_TOKEN }}
 
   wheels:
     if: github.event_name == 'push' && startsWith(github.ref, 'refs/tags')
@@ -81,20 +81,21 @@ jobs:
     timeout-minutes: 15
     strategy:
       matrix:
-        os: [ubuntu-latest, windows-latest, macos-latest]
+        os: [ubuntu-latest, windows-latest, macos-13, macos-14]
     runs-on: ${{ matrix.os }}
     steps:
-    - uses: actions/checkout at v3
+    - uses: actions/checkout at v4
       with:
         fetch-depth: 0  # required for setuptools_scm
     - name: Build wheels
-      uses: pypa/cibuildwheel at v2.16.2
+      uses: pypa/cibuildwheel at v2.17.0
       env:
-        CIBW_BUILD: "cp*-manylinux_x86_64 cp3*-win_amd64 cp3*-macosx_x86_64"
+        CIBW_BUILD: "cp*-manylinux_x86_64 cp3*-win_amd64 cp3*-macosx_x86_64 cp3*-macosx_arm64"
         CIBW_SKIP: "cp37-*"
-    - uses: actions/upload-artifact at v3
+        CIBW_TEST_SKIP: "cp38-macosx_*:arm64"
+    - uses: actions/upload-artifact at v4
       with:
-        name: wheels
+        name: wheels-${{ matrix.os }}
         path: wheelhouse/*.whl
 
   publish:
@@ -102,14 +103,15 @@ jobs:
     needs: [build, wheels]
     runs-on: ubuntu-latest
     steps:
-    - uses: actions/download-artifact at v3
+    - uses: actions/download-artifact at v4
       with:
         name: sdist
         path: dist/
-    - uses: actions/download-artifact at v3
+    - uses: actions/download-artifact at v4
       with:
-        name: wheels
+        pattern: wheels-*
         path: dist/
+        merge-multiple: true
     - name: Publish to PyPI
       uses: pypa/gh-action-pypi-publish at v1.5.1
       with:


=====================================
CHANGES.rst
=====================================
@@ -2,6 +2,11 @@
 Changelog
 =========
 
+v1.2.1 (2024-06-17)
+-------------------
+
+* Make macOS ARM64 wheels available.
+
 v1.2.0 (2023-12-11)
 -------------------
 


=====================================
CITATION.cff
=====================================
@@ -0,0 +1,16 @@
+cff-version: 1.2.0
+title: dnaio
+type: software
+authors:
+  - given-names: Marcel
+    family-names: Martin
+    orcid: 'https://orcid.org/0000-0002-0680-200X'
+  - given-names: Ruben Harmen Paul
+    family-names: Vorderman
+    orcid: 'https://orcid.org/0000-0002-8813-1528'
+identifiers:
+  - type: doi
+    value: 10.5281/zenodo.10548864
+repository-code: 'https://github.com/marcelm/dnaio/'
+url: 'https://dnaio.readthedocs.io/'
+license: MIT


=====================================
README.rst
=====================================
@@ -1,3 +1,6 @@
+.. image:: https://zenodo.org/badge/DOI/10.5281/zenodo.10548864.svg
+  :target: https://doi.org/10.5281/zenodo.10548864
+
 .. image:: https://github.com/marcelm/dnaio/workflows/CI/badge.svg
     :alt: GitHub Actions badge
 
@@ -9,11 +12,15 @@
     :target: https://codecov.io/gh/marcelm/dnaio
     :alt: Codecov badge
 
-=====================================
-dnaio processes FASTQ and FASTA files
-=====================================
+===========================================
+dnaio processes FASTQ, FASTA and uBAM files
+===========================================
 
 ``dnaio`` is a Python 3.8+ library for very efficient parsing and writing of FASTQ and also FASTA files.
+Since ``dnaio`` version 1.1.0, support for efficiently parsing uBAM files has been implemented.
+This allows reading ONT files from the `dorado <https://github.com/nanoporetech/dorado>`_
+basecaller directly.
+
 The code was previously part of the
 `Cutadapt <https://cutadapt.readthedocs.io/>`_ tool and has been improved significantly since it has been split out.
 
@@ -36,7 +43,7 @@ For more, see the `tutorial <https://dnaio.readthedocs.io/en/latest/tutorial.htm
 Installation
 ============
 
-Using pip:: 
+Using pip::
 
     pip install dnaio zstandard
 
@@ -58,7 +65,7 @@ Limitations
 ===========
 
 - Multi-line FASTQ files are not supported
-- FASTQ and BAM parsing is the focus of this library. The FASTA parser is not as optimized
+- FASTQ and uBAM parsing is the focus of this library. The FASTA parser is not as optimized
 
 Links
 =====


=====================================
debian/changelog
=====================================
@@ -1,3 +1,11 @@
+python-dnaio (1.2.1-1) unstable; urgency=medium
+
+  * New upstream version 1.2.1
+  * d/control: add myself to uploaders.
+  * d/control: declare compliance to standards version 4.7.0.
+
+ -- Étienne Mollier <emollier at debian.org>  Sat, 31 Aug 2024 11:39:26 +0200
+
 python-dnaio (1.2.0-2) unstable; urgency=medium
 
   * Team upload.


=====================================
debian/control
=====================================
@@ -1,6 +1,7 @@
 Source: python-dnaio
 Maintainer: Debian Med Packaging Team <debian-med-packaging at lists.alioth.debian.org>
-Uploaders: Liubov Chuprikova <chuprikovalv at gmail.com>
+Uploaders: Liubov Chuprikova <chuprikovalv at gmail.com>,
+           Étienne Mollier <emollier at debian.org>
 Section: python
 Testsuite: autopkgtest-pkg-pybuild
 Priority: optional
@@ -14,7 +15,7 @@ Build-Depends: debhelper-compat (= 13),
                python3-pytest <!nocheck>,
                python3-xopen <!nocheck>,
                cython3
-Standards-Version: 4.6.2
+Standards-Version: 4.7.0
 Vcs-Browser: https://salsa.debian.org/med-team/python-dnaio
 Vcs-Git: https://salsa.debian.org/med-team/python-dnaio.git
 Homepage: https://github.com/marcelm/dnaio


=====================================
pyproject.toml
=====================================
@@ -52,7 +52,7 @@ CFLAGS = "-g0 -DNDEBUG"
 CFLAGS = "-g0 -DNDEBUG"
 
 [tool.cibuildwheel.linux.environment]
-CFLAGS = "-g0 -DNDEBUG -mssse3"
+CFLAGS = "-g0 -DNDEBUG"
 
 [tool.cibuildwheel]
 test-requires = "pytest"


=====================================
src/dnaio/ascii_check.h
=====================================
@@ -1,12 +1,18 @@
 #include <stddef.h>
-#include <stdint.h>
-#ifdef __SSE2__
-#include "emmintrin.h"
-#endif
 
 #define ASCII_MASK_8BYTE 0x8080808080808080ULL
 #define ASCII_MASK_1BYTE 0x80
 
+static inline int string_is_ascii_fallback(const char *string, size_t length)
+{
+    /* Combining all characters with OR allows for only one bit check at the end */
+    size_t all_chars = 0;
+    for (size_t i=0; i<length; i++) {
+        all_chars |= string[i];
+    }
+    return !(all_chars & ASCII_MASK_1BYTE);
+}
+
 /**
  * @brief Check if a string of given length only contains ASCII characters.
  *
@@ -16,33 +22,37 @@
  * @returns 1 if the string is ASCII-only, 0 otherwise.
  */
 static int
-string_is_ascii(const char * string, size_t length) {
-    // By performing bitwise OR on all characters in 8-byte chunks (16-byte
-    // with SSE2) we can
-    // determine ASCII status in a non-branching (except the loops) fashion.
-    uint64_t all_chars = 0;
-    const char *cursor = string;
-    const char *string_end_ptr = string + length;
-    const char *string_8b_end_ptr = string_end_ptr - sizeof(uint64_t);
-    int non_ascii_in_vec = 0;
-    #ifdef __SSE2__
-    const char *string_16b_end_ptr = string_end_ptr - sizeof(__m128i);
-    __m128i vec_all_chars = _mm_setzero_si128();
-    while (cursor < string_16b_end_ptr) {
-        __m128i loaded_chars = _mm_loadu_si128((__m128i *)cursor);
-        vec_all_chars = _mm_or_si128(loaded_chars, vec_all_chars);
-        cursor += sizeof(__m128i);
+string_is_ascii(const char *string, size_t length)
+{
+    if (length < sizeof(size_t)) {
+        return string_is_ascii_fallback(string, length);
     }
-    non_ascii_in_vec = _mm_movemask_epi8(vec_all_chars);
-    #endif
-
-    while (cursor < string_8b_end_ptr) {
-        all_chars |= *(uint64_t *)cursor;
-        cursor += sizeof(uint64_t);
+    size_t number_of_chunks = length / sizeof(size_t);
+    size_t *chunks = (size_t *)string;
+    size_t number_of_unrolls = number_of_chunks / 4;
+    size_t remaining_chunks = number_of_chunks - (number_of_unrolls * 4);
+    size_t *chunk_ptr = chunks;
+    size_t all_chars0 = 0;
+    size_t all_chars1 = 0;
+    size_t all_chars2 = 0;
+    size_t all_chars3 = 0;
+    for (size_t i=0; i < number_of_unrolls; i++) {
+        /* Performing indepedent OR calculations allows the compiler to use
+           vectors. It also allows out of order execution. */
+        all_chars0 |= chunk_ptr[0];
+        all_chars1 |= chunk_ptr[1];
+        all_chars2 |= chunk_ptr[2];
+        all_chars3 |= chunk_ptr[3];
+        chunk_ptr += 4;
     }
-    while (cursor < string_end_ptr) {
-        all_chars |= *cursor;
-        cursor += 1;
+    size_t all_chars = all_chars0 | all_chars1 | all_chars2 | all_chars3;
+    for (size_t i=0; i<remaining_chunks; i++) {
+        all_chars |= chunk_ptr[i];
     }
-    return !(non_ascii_in_vec + (all_chars & ASCII_MASK_8BYTE));
+    /* Load the last few bytes left in a single integer for fast operations.
+       There is some overlap here with the work done before, but for a simple
+       ascii check this does not matter. */
+    size_t last_chunk = *(size_t *)(string + length - sizeof(size_t));
+    all_chars |= last_chunk;
+    return !(all_chars & ASCII_MASK_8BYTE);
 }


=====================================
src/dnaio/bam.h
=====================================
@@ -1,21 +1,33 @@
-#include <stdint.h>
-#include <stddef.h>
-#include <string.h>
-#include <assert.h>
+// Macros also used in htslib, very useful.
+#if defined __GNUC__
+#define GCC_AT_LEAST(major, minor) \
+    (__GNUC__ > (major) || (__GNUC__ == (major) && __GNUC_MINOR__ >= (minor)))
+#else 
+# define GCC_AT_LEAST(major, minor) 0
+#endif
 
-#ifdef __SSE2__
-#include "emmintrin.h"
+#if defined(__clang__) && defined(__has_attribute)
+#define CLANG_COMPILER_HAS(attribute) __has_attribute(attribute)
+#else
+#define CLANG_COMPILER_HAS(attribute) 0
 #endif
 
-#ifdef __SSSE3__
-#include "tmmintrin.h"
+#define COMPILER_HAS_TARGET (GCC_AT_LEAST(4, 8) || CLANG_COMPILER_HAS(__target__))
+#define COMPILER_HAS_OPTIMIZE (GCC_AT_LEAST(4,4) || CLANG_COMPILER_HAS(optimize))
+
+#if defined(__x86_64__) || defined(_M_X64)
+#define BUILD_IS_X86_64 1
+#include "immintrin.h"
+#else
+#define BUILD_IS_X86_64 0
 #endif
 
-static void
-decode_bam_sequence(uint8_t *dest, const uint8_t *encoded_sequence, size_t length)
-{
-    /* Reuse a trick from sam_internal.h in htslib. Have a table to lookup
-       two characters simultaneously.*/
+#include <stdint.h>
+#include <string.h>
+#include <stddef.h>
+
+static void 
+decode_bam_sequence_default(uint8_t *dest, const uint8_t *encoded_sequence, size_t length)  {
     static const char code2base[512] =
         "===A=C=M=G=R=S=V=T=W=Y=H=K=D=B=N"
         "A=AAACAMAGARASAVATAWAYAHAKADABAN"
@@ -34,10 +46,26 @@ decode_bam_sequence(uint8_t *dest, const uint8_t *encoded_sequence, size_t lengt
         "B=BABCBMBGBRBSBVBTBWBYBHBKBDBBBN"
         "N=NANCNMNGNRNSNVNTNWNYNHNKNDNBNN";
     static const uint8_t *nuc_lookup = (uint8_t *)"=ACMGRSVTWYHKDBN";
+    size_t length_2 = length / 2; 
+    for (size_t i=0; i < length_2; i++) {
+        memcpy(dest + i*2, code2base + ((size_t)encoded_sequence[i] * 2), 2);
+    }
+    if (length & 1) {
+        uint8_t encoded = encoded_sequence[length_2] >> 4;
+        dest[(length - 1)] = nuc_lookup[encoded];
+    }
+}
+
+#if COMPILER_HAS_TARGET && BUILD_IS_X86_64
+__attribute__((__target__("ssse3")))
+static void 
+decode_bam_sequence_ssse3(uint8_t *dest, const uint8_t *encoded_sequence, size_t length) 
+{
+
+    static const uint8_t *nuc_lookup = (uint8_t *)"=ACMGRSVTWYHKDBN";
     const uint8_t *dest_end_ptr = dest + length;
     uint8_t *dest_cursor = dest;
     const uint8_t *encoded_cursor = encoded_sequence;
-    #ifdef __SSSE3__
     const uint8_t *dest_vec_end_ptr = dest_end_ptr - (2 * sizeof(__m128i));
     __m128i first_upper_shuffle = _mm_setr_epi8(
         0, 0xff, 1, 0xff, 2, 0xff, 3, 0xff, 4, 0xff, 5, 0xff, 6, 0xff, 7, 0xff);
@@ -84,44 +112,47 @@ decode_bam_sequence(uint8_t *dest, const uint8_t *encoded_sequence, size_t lengt
         encoded_cursor += sizeof(__m128i);
         dest_cursor += 2 * sizeof(__m128i);
     }
-    #endif
-    /* Do two at the time until it gets to the last even address. */
-    const uint8_t *dest_end_ptr_twoatatime = dest + (length & (~1ULL));
-    while (dest_cursor < dest_end_ptr_twoatatime) {
-        /* According to htslib, size_t cast helps the optimizer.
-           Code confirmed to indeed run faster. */
-        memcpy(dest_cursor, code2base + ((size_t)*encoded_cursor * 2), 2);
-        dest_cursor += 2;
-        encoded_cursor += 1;
+    decode_bam_sequence_default(dest_cursor, encoded_cursor, dest_end_ptr - dest_cursor);
+}
+
+static void (*decode_bam_sequence)(
+    uint8_t *dest, const uint8_t *encoded_sequence, size_t length);
+
+/* Simple dispatcher function, updates the function pointer after testing the
+   CPU capabilities. After this, the dispatcher function is not needed anymore. */
+static void decode_bam_sequence_dispatch(
+        uint8_t *dest, const uint8_t *encoded_sequence, size_t length) {
+    if (__builtin_cpu_supports("ssse3")) {
+        decode_bam_sequence = decode_bam_sequence_ssse3;
     }
-    assert((dest_end_ptr - dest_cursor) < 2);
-    if (dest_cursor != dest_end_ptr) {
-        /* There is a single encoded nuc left */
-        uint8_t encoded_nucs = *encoded_cursor;
-        uint8_t upper_nuc_index = encoded_nucs >> 4;
-        dest_cursor[0] = nuc_lookup[upper_nuc_index];
+    else {
+        decode_bam_sequence = decode_bam_sequence_default;
     }
+    decode_bam_sequence(dest, encoded_sequence, length);
+}
+
+static void (*decode_bam_sequence)(
+    uint8_t *dest, const uint8_t *encoded_sequence, size_t length
+) = decode_bam_sequence_dispatch;
+
+#else
+static inline void decode_bam_sequence(uint8_t *dest, const uint8_t *encoded_sequence, size_t length) 
+{
+    decode_bam_sequence_default(dest, encoded_sequence, length);
 }
+#endif 
 
-static void
-decode_bam_qualities(uint8_t *dest, const uint8_t *encoded_qualities, size_t length)
+// Code is simple enough to be auto vectorized.
+#if COMPILER_HAS_OPTIMIZE
+__attribute__((optimize("O3")))
+#endif
+static void 
+decode_bam_qualities(
+    uint8_t *restrict dest,
+    const uint8_t *restrict encoded_qualities,
+    size_t length)
 {
-    const uint8_t *end_ptr = encoded_qualities + length;
-    const uint8_t *cursor = encoded_qualities;
-    uint8_t *dest_cursor = dest;
-    #ifdef __SSE2__
-    const uint8_t *vec_end_ptr = end_ptr - sizeof(__m128i);
-    while (cursor < vec_end_ptr) {
-        __m128i quals = _mm_loadu_si128((__m128i *)cursor);
-        __m128i phreds = _mm_add_epi8(quals, _mm_set1_epi8(33));
-        _mm_storeu_si128((__m128i *)dest_cursor, phreds);
-        cursor += sizeof(__m128i);
-        dest_cursor += sizeof(__m128i);
-    }
-    #endif
-    while (cursor < end_ptr) {
-        *dest_cursor = *cursor + 33;
-        cursor += 1;
-        dest_cursor += 1;
+    for (size_t i=0; i<length; i++) {
+        dest[i] = encoded_qualities[i] + 33;
     }
-}
\ No newline at end of file
+}


=====================================
src/dnaio/chunks.py
=====================================
@@ -189,7 +189,7 @@ def read_paired_chunks(
     start2 = f2.readinto(memoryview(buf2)[0:1])
 
     if start1 == 0 and start2 == 0:
-        return memoryview(b""), memoryview(b"")
+        return
 
     if (start1 == 0) != (start2 == 0):
         i = 2 if start1 == 0 else 1


=====================================
tests/test_chunks.py
=====================================
@@ -122,6 +122,15 @@ def test_fastq_head():
     assert _fastq_head(b"A\nB\nC\nD\nE\nF\nG\nH\nI\n") == 16
 
 
+def test_read_paired_chunks_empty_input(tmp_path):
+    empty_fastq = tmp_path / "empty.fastq"
+    empty_fastq.write_text("")
+    with open(empty_fastq, "rb") as f1:
+        with open(empty_fastq, "rb") as f2:
+            chunks = list(read_paired_chunks(f1, f2))
+    assert len(chunks) == 0
+
+
 def test_read_paired_chunks_fastq():
     with open("tests/data/paired.1.fastq", "rb") as f1:
         with open("tests/data/paired.2.fastq", "rb") as f2:



View it on GitLab: https://salsa.debian.org/med-team/python-dnaio/-/compare/52f500fdfd91d8e7dad21a49584823cc0d0eb8ac...4004a7ec7e123f1a818834812b418dad98804473

-- 
View it on GitLab: https://salsa.debian.org/med-team/python-dnaio/-/compare/52f500fdfd91d8e7dad21a49584823cc0d0eb8ac...4004a7ec7e123f1a818834812b418dad98804473
You're receiving this email because of your account on salsa.debian.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20240831/798961d1/attachment-0001.htm>


More information about the debian-med-commit mailing list