[med-svn] [Git][med-team/python-dnaio][master] 5 commits: New upstream version 1.2.1
Étienne Mollier (@emollier)
gitlab at salsa.debian.org
Sat Aug 31 10:43:47 BST 2024
Étienne Mollier pushed to branch master at Debian Med / python-dnaio
Commits:
649d84a4 by Étienne Mollier at 2024-08-31T11:37:53+02:00
New upstream version 1.2.1
- - - - -
ca4221c5 by Étienne Mollier at 2024-08-31T11:37:53+02:00
Update upstream source from tag 'upstream/1.2.1'
Update to upstream version '1.2.1'
with Debian dir 0dd159ac5634c679e1bf1d5b412e31fbd4d1d630
- - - - -
7d72c4ef by Étienne Mollier at 2024-08-31T11:38:50+02:00
d/control: add myself to uploaders.
- - - - -
77d2161e by Étienne Mollier at 2024-08-31T11:39:07+02:00
d/control: declare compliance to standards version 4.7.0.
- - - - -
4004a7ec by Étienne Mollier at 2024-08-31T11:42:06+02:00
Ready for upload to unstable.
- - - - -
11 changed files:
- .github/workflows/ci.yml
- CHANGES.rst
- + CITATION.cff
- README.rst
- debian/changelog
- debian/control
- pyproject.toml
- src/dnaio/ascii_check.h
- src/dnaio/bam.h
- src/dnaio/chunks.py
- tests/test_chunks.py
Changes:
=====================================
.github/workflows/ci.yml
=====================================
@@ -15,9 +15,9 @@ jobs:
python-version: ["3.10"]
toxenv: [flake8, black, mypy, docs]
steps:
- - uses: actions/checkout at v3
+ - uses: actions/checkout at v4
- name: Set up Python ${{ matrix.python-version }}
- uses: actions/setup-python at v4
+ uses: actions/setup-python at v5
with:
python-version: ${{ matrix.python-version }}
- name: Install tox
@@ -31,12 +31,12 @@ jobs:
github.event.pull_request.head.repo.full_name != github.event.pull_request.base.repo.full_name
runs-on: ubuntu-latest
steps:
- - uses: actions/checkout at v3
+ - uses: actions/checkout at v4
with:
fetch-depth: 0 # required for setuptools_scm
- name: Build sdist and temporary wheel
run: pipx run build
- - uses: actions/upload-artifact at v3
+ - uses: actions/upload-artifact at v4
with:
name: sdist
path: dist/*.tar.gz
@@ -51,29 +51,29 @@ jobs:
matrix:
os: [ubuntu-latest]
python-version: ["3.8", "3.9", "3.10", "3.11", "3.12"]
- compile_flags: [""]
include:
- - os: macos-latest
+ - os: macos-13
+ python-version: "3.10"
+ - os: macos-14
python-version: "3.10"
- os: windows-latest
python-version: "3.10"
- os: ubuntu-latest
python-version: "3.10"
- compile_flags: "-mssse3"
steps:
- - uses: actions/checkout at v3
+ - uses: actions/checkout at v4
- name: Set up Python ${{ matrix.python-version }}
- uses: actions/setup-python at v4
+ uses: actions/setup-python at v5
with:
python-version: ${{ matrix.python-version }}
- name: Install tox
run: python -m pip install tox
- name: Test
run: tox -e py
- env:
- CFLAGS: ${{ matrix.compile_flags }}
- name: Upload coverage report
- uses: codecov/codecov-action at v3
+ uses: codecov/codecov-action at v4
+ with:
+ token: ${{ secrets.CODECOV_TOKEN }}
wheels:
if: github.event_name == 'push' && startsWith(github.ref, 'refs/tags')
@@ -81,20 +81,21 @@ jobs:
timeout-minutes: 15
strategy:
matrix:
- os: [ubuntu-latest, windows-latest, macos-latest]
+ os: [ubuntu-latest, windows-latest, macos-13, macos-14]
runs-on: ${{ matrix.os }}
steps:
- - uses: actions/checkout at v3
+ - uses: actions/checkout at v4
with:
fetch-depth: 0 # required for setuptools_scm
- name: Build wheels
- uses: pypa/cibuildwheel at v2.16.2
+ uses: pypa/cibuildwheel at v2.17.0
env:
- CIBW_BUILD: "cp*-manylinux_x86_64 cp3*-win_amd64 cp3*-macosx_x86_64"
+ CIBW_BUILD: "cp*-manylinux_x86_64 cp3*-win_amd64 cp3*-macosx_x86_64 cp3*-macosx_arm64"
CIBW_SKIP: "cp37-*"
- - uses: actions/upload-artifact at v3
+ CIBW_TEST_SKIP: "cp38-macosx_*:arm64"
+ - uses: actions/upload-artifact at v4
with:
- name: wheels
+ name: wheels-${{ matrix.os }}
path: wheelhouse/*.whl
publish:
@@ -102,14 +103,15 @@ jobs:
needs: [build, wheels]
runs-on: ubuntu-latest
steps:
- - uses: actions/download-artifact at v3
+ - uses: actions/download-artifact at v4
with:
name: sdist
path: dist/
- - uses: actions/download-artifact at v3
+ - uses: actions/download-artifact at v4
with:
- name: wheels
+ pattern: wheels-*
path: dist/
+ merge-multiple: true
- name: Publish to PyPI
uses: pypa/gh-action-pypi-publish at v1.5.1
with:
=====================================
CHANGES.rst
=====================================
@@ -2,6 +2,11 @@
Changelog
=========
+v1.2.1 (2024-06-17)
+-------------------
+
+* Make macOS ARM64 wheels available.
+
v1.2.0 (2023-12-11)
-------------------
=====================================
CITATION.cff
=====================================
@@ -0,0 +1,16 @@
+cff-version: 1.2.0
+title: dnaio
+type: software
+authors:
+ - given-names: Marcel
+ family-names: Martin
+ orcid: 'https://orcid.org/0000-0002-0680-200X'
+ - given-names: Ruben Harmen Paul
+ family-names: Vorderman
+ orcid: 'https://orcid.org/0000-0002-8813-1528'
+identifiers:
+ - type: doi
+ value: 10.5281/zenodo.10548864
+repository-code: 'https://github.com/marcelm/dnaio/'
+url: 'https://dnaio.readthedocs.io/'
+license: MIT
=====================================
README.rst
=====================================
@@ -1,3 +1,6 @@
+.. image:: https://zenodo.org/badge/DOI/10.5281/zenodo.10548864.svg
+ :target: https://doi.org/10.5281/zenodo.10548864
+
.. image:: https://github.com/marcelm/dnaio/workflows/CI/badge.svg
:alt: GitHub Actions badge
@@ -9,11 +12,15 @@
:target: https://codecov.io/gh/marcelm/dnaio
:alt: Codecov badge
-=====================================
-dnaio processes FASTQ and FASTA files
-=====================================
+===========================================
+dnaio processes FASTQ, FASTA and uBAM files
+===========================================
``dnaio`` is a Python 3.8+ library for very efficient parsing and writing of FASTQ and also FASTA files.
+Since ``dnaio`` version 1.1.0, support for efficiently parsing uBAM files has been implemented.
+This allows reading ONT files from the `dorado <https://github.com/nanoporetech/dorado>`_
+basecaller directly.
+
The code was previously part of the
`Cutadapt <https://cutadapt.readthedocs.io/>`_ tool and has been improved significantly since it has been split out.
@@ -36,7 +43,7 @@ For more, see the `tutorial <https://dnaio.readthedocs.io/en/latest/tutorial.htm
Installation
============
-Using pip::
+Using pip::
pip install dnaio zstandard
@@ -58,7 +65,7 @@ Limitations
===========
- Multi-line FASTQ files are not supported
-- FASTQ and BAM parsing is the focus of this library. The FASTA parser is not as optimized
+- FASTQ and uBAM parsing is the focus of this library. The FASTA parser is not as optimized
Links
=====
=====================================
debian/changelog
=====================================
@@ -1,3 +1,11 @@
+python-dnaio (1.2.1-1) unstable; urgency=medium
+
+ * New upstream version 1.2.1
+ * d/control: add myself to uploaders.
+ * d/control: declare compliance to standards version 4.7.0.
+
+ -- Étienne Mollier <emollier at debian.org> Sat, 31 Aug 2024 11:39:26 +0200
+
python-dnaio (1.2.0-2) unstable; urgency=medium
* Team upload.
=====================================
debian/control
=====================================
@@ -1,6 +1,7 @@
Source: python-dnaio
Maintainer: Debian Med Packaging Team <debian-med-packaging at lists.alioth.debian.org>
-Uploaders: Liubov Chuprikova <chuprikovalv at gmail.com>
+Uploaders: Liubov Chuprikova <chuprikovalv at gmail.com>,
+ Étienne Mollier <emollier at debian.org>
Section: python
Testsuite: autopkgtest-pkg-pybuild
Priority: optional
@@ -14,7 +15,7 @@ Build-Depends: debhelper-compat (= 13),
python3-pytest <!nocheck>,
python3-xopen <!nocheck>,
cython3
-Standards-Version: 4.6.2
+Standards-Version: 4.7.0
Vcs-Browser: https://salsa.debian.org/med-team/python-dnaio
Vcs-Git: https://salsa.debian.org/med-team/python-dnaio.git
Homepage: https://github.com/marcelm/dnaio
=====================================
pyproject.toml
=====================================
@@ -52,7 +52,7 @@ CFLAGS = "-g0 -DNDEBUG"
CFLAGS = "-g0 -DNDEBUG"
[tool.cibuildwheel.linux.environment]
-CFLAGS = "-g0 -DNDEBUG -mssse3"
+CFLAGS = "-g0 -DNDEBUG"
[tool.cibuildwheel]
test-requires = "pytest"
=====================================
src/dnaio/ascii_check.h
=====================================
@@ -1,12 +1,18 @@
#include <stddef.h>
-#include <stdint.h>
-#ifdef __SSE2__
-#include "emmintrin.h"
-#endif
#define ASCII_MASK_8BYTE 0x8080808080808080ULL
#define ASCII_MASK_1BYTE 0x80
+static inline int string_is_ascii_fallback(const char *string, size_t length)
+{
+ /* Combining all characters with OR allows for only one bit check at the end */
+ size_t all_chars = 0;
+ for (size_t i=0; i<length; i++) {
+ all_chars |= string[i];
+ }
+ return !(all_chars & ASCII_MASK_1BYTE);
+}
+
/**
* @brief Check if a string of given length only contains ASCII characters.
*
@@ -16,33 +22,37 @@
* @returns 1 if the string is ASCII-only, 0 otherwise.
*/
static int
-string_is_ascii(const char * string, size_t length) {
- // By performing bitwise OR on all characters in 8-byte chunks (16-byte
- // with SSE2) we can
- // determine ASCII status in a non-branching (except the loops) fashion.
- uint64_t all_chars = 0;
- const char *cursor = string;
- const char *string_end_ptr = string + length;
- const char *string_8b_end_ptr = string_end_ptr - sizeof(uint64_t);
- int non_ascii_in_vec = 0;
- #ifdef __SSE2__
- const char *string_16b_end_ptr = string_end_ptr - sizeof(__m128i);
- __m128i vec_all_chars = _mm_setzero_si128();
- while (cursor < string_16b_end_ptr) {
- __m128i loaded_chars = _mm_loadu_si128((__m128i *)cursor);
- vec_all_chars = _mm_or_si128(loaded_chars, vec_all_chars);
- cursor += sizeof(__m128i);
+string_is_ascii(const char *string, size_t length)
+{
+ if (length < sizeof(size_t)) {
+ return string_is_ascii_fallback(string, length);
}
- non_ascii_in_vec = _mm_movemask_epi8(vec_all_chars);
- #endif
-
- while (cursor < string_8b_end_ptr) {
- all_chars |= *(uint64_t *)cursor;
- cursor += sizeof(uint64_t);
+ size_t number_of_chunks = length / sizeof(size_t);
+ size_t *chunks = (size_t *)string;
+ size_t number_of_unrolls = number_of_chunks / 4;
+ size_t remaining_chunks = number_of_chunks - (number_of_unrolls * 4);
+ size_t *chunk_ptr = chunks;
+ size_t all_chars0 = 0;
+ size_t all_chars1 = 0;
+ size_t all_chars2 = 0;
+ size_t all_chars3 = 0;
+ for (size_t i=0; i < number_of_unrolls; i++) {
+ /* Performing indepedent OR calculations allows the compiler to use
+ vectors. It also allows out of order execution. */
+ all_chars0 |= chunk_ptr[0];
+ all_chars1 |= chunk_ptr[1];
+ all_chars2 |= chunk_ptr[2];
+ all_chars3 |= chunk_ptr[3];
+ chunk_ptr += 4;
}
- while (cursor < string_end_ptr) {
- all_chars |= *cursor;
- cursor += 1;
+ size_t all_chars = all_chars0 | all_chars1 | all_chars2 | all_chars3;
+ for (size_t i=0; i<remaining_chunks; i++) {
+ all_chars |= chunk_ptr[i];
}
- return !(non_ascii_in_vec + (all_chars & ASCII_MASK_8BYTE));
+ /* Load the last few bytes left in a single integer for fast operations.
+ There is some overlap here with the work done before, but for a simple
+ ascii check this does not matter. */
+ size_t last_chunk = *(size_t *)(string + length - sizeof(size_t));
+ all_chars |= last_chunk;
+ return !(all_chars & ASCII_MASK_8BYTE);
}
=====================================
src/dnaio/bam.h
=====================================
@@ -1,21 +1,33 @@
-#include <stdint.h>
-#include <stddef.h>
-#include <string.h>
-#include <assert.h>
+// Macros also used in htslib, very useful.
+#if defined __GNUC__
+#define GCC_AT_LEAST(major, minor) \
+ (__GNUC__ > (major) || (__GNUC__ == (major) && __GNUC_MINOR__ >= (minor)))
+#else
+# define GCC_AT_LEAST(major, minor) 0
+#endif
-#ifdef __SSE2__
-#include "emmintrin.h"
+#if defined(__clang__) && defined(__has_attribute)
+#define CLANG_COMPILER_HAS(attribute) __has_attribute(attribute)
+#else
+#define CLANG_COMPILER_HAS(attribute) 0
#endif
-#ifdef __SSSE3__
-#include "tmmintrin.h"
+#define COMPILER_HAS_TARGET (GCC_AT_LEAST(4, 8) || CLANG_COMPILER_HAS(__target__))
+#define COMPILER_HAS_OPTIMIZE (GCC_AT_LEAST(4,4) || CLANG_COMPILER_HAS(optimize))
+
+#if defined(__x86_64__) || defined(_M_X64)
+#define BUILD_IS_X86_64 1
+#include "immintrin.h"
+#else
+#define BUILD_IS_X86_64 0
#endif
-static void
-decode_bam_sequence(uint8_t *dest, const uint8_t *encoded_sequence, size_t length)
-{
- /* Reuse a trick from sam_internal.h in htslib. Have a table to lookup
- two characters simultaneously.*/
+#include <stdint.h>
+#include <string.h>
+#include <stddef.h>
+
+static void
+decode_bam_sequence_default(uint8_t *dest, const uint8_t *encoded_sequence, size_t length) {
static const char code2base[512] =
"===A=C=M=G=R=S=V=T=W=Y=H=K=D=B=N"
"A=AAACAMAGARASAVATAWAYAHAKADABAN"
@@ -34,10 +46,26 @@ decode_bam_sequence(uint8_t *dest, const uint8_t *encoded_sequence, size_t lengt
"B=BABCBMBGBRBSBVBTBWBYBHBKBDBBBN"
"N=NANCNMNGNRNSNVNTNWNYNHNKNDNBNN";
static const uint8_t *nuc_lookup = (uint8_t *)"=ACMGRSVTWYHKDBN";
+ size_t length_2 = length / 2;
+ for (size_t i=0; i < length_2; i++) {
+ memcpy(dest + i*2, code2base + ((size_t)encoded_sequence[i] * 2), 2);
+ }
+ if (length & 1) {
+ uint8_t encoded = encoded_sequence[length_2] >> 4;
+ dest[(length - 1)] = nuc_lookup[encoded];
+ }
+}
+
+#if COMPILER_HAS_TARGET && BUILD_IS_X86_64
+__attribute__((__target__("ssse3")))
+static void
+decode_bam_sequence_ssse3(uint8_t *dest, const uint8_t *encoded_sequence, size_t length)
+{
+
+ static const uint8_t *nuc_lookup = (uint8_t *)"=ACMGRSVTWYHKDBN";
const uint8_t *dest_end_ptr = dest + length;
uint8_t *dest_cursor = dest;
const uint8_t *encoded_cursor = encoded_sequence;
- #ifdef __SSSE3__
const uint8_t *dest_vec_end_ptr = dest_end_ptr - (2 * sizeof(__m128i));
__m128i first_upper_shuffle = _mm_setr_epi8(
0, 0xff, 1, 0xff, 2, 0xff, 3, 0xff, 4, 0xff, 5, 0xff, 6, 0xff, 7, 0xff);
@@ -84,44 +112,47 @@ decode_bam_sequence(uint8_t *dest, const uint8_t *encoded_sequence, size_t lengt
encoded_cursor += sizeof(__m128i);
dest_cursor += 2 * sizeof(__m128i);
}
- #endif
- /* Do two at the time until it gets to the last even address. */
- const uint8_t *dest_end_ptr_twoatatime = dest + (length & (~1ULL));
- while (dest_cursor < dest_end_ptr_twoatatime) {
- /* According to htslib, size_t cast helps the optimizer.
- Code confirmed to indeed run faster. */
- memcpy(dest_cursor, code2base + ((size_t)*encoded_cursor * 2), 2);
- dest_cursor += 2;
- encoded_cursor += 1;
+ decode_bam_sequence_default(dest_cursor, encoded_cursor, dest_end_ptr - dest_cursor);
+}
+
+static void (*decode_bam_sequence)(
+ uint8_t *dest, const uint8_t *encoded_sequence, size_t length);
+
+/* Simple dispatcher function, updates the function pointer after testing the
+ CPU capabilities. After this, the dispatcher function is not needed anymore. */
+static void decode_bam_sequence_dispatch(
+ uint8_t *dest, const uint8_t *encoded_sequence, size_t length) {
+ if (__builtin_cpu_supports("ssse3")) {
+ decode_bam_sequence = decode_bam_sequence_ssse3;
}
- assert((dest_end_ptr - dest_cursor) < 2);
- if (dest_cursor != dest_end_ptr) {
- /* There is a single encoded nuc left */
- uint8_t encoded_nucs = *encoded_cursor;
- uint8_t upper_nuc_index = encoded_nucs >> 4;
- dest_cursor[0] = nuc_lookup[upper_nuc_index];
+ else {
+ decode_bam_sequence = decode_bam_sequence_default;
}
+ decode_bam_sequence(dest, encoded_sequence, length);
+}
+
+static void (*decode_bam_sequence)(
+ uint8_t *dest, const uint8_t *encoded_sequence, size_t length
+) = decode_bam_sequence_dispatch;
+
+#else
+static inline void decode_bam_sequence(uint8_t *dest, const uint8_t *encoded_sequence, size_t length)
+{
+ decode_bam_sequence_default(dest, encoded_sequence, length);
}
+#endif
-static void
-decode_bam_qualities(uint8_t *dest, const uint8_t *encoded_qualities, size_t length)
+// Code is simple enough to be auto vectorized.
+#if COMPILER_HAS_OPTIMIZE
+__attribute__((optimize("O3")))
+#endif
+static void
+decode_bam_qualities(
+ uint8_t *restrict dest,
+ const uint8_t *restrict encoded_qualities,
+ size_t length)
{
- const uint8_t *end_ptr = encoded_qualities + length;
- const uint8_t *cursor = encoded_qualities;
- uint8_t *dest_cursor = dest;
- #ifdef __SSE2__
- const uint8_t *vec_end_ptr = end_ptr - sizeof(__m128i);
- while (cursor < vec_end_ptr) {
- __m128i quals = _mm_loadu_si128((__m128i *)cursor);
- __m128i phreds = _mm_add_epi8(quals, _mm_set1_epi8(33));
- _mm_storeu_si128((__m128i *)dest_cursor, phreds);
- cursor += sizeof(__m128i);
- dest_cursor += sizeof(__m128i);
- }
- #endif
- while (cursor < end_ptr) {
- *dest_cursor = *cursor + 33;
- cursor += 1;
- dest_cursor += 1;
+ for (size_t i=0; i<length; i++) {
+ dest[i] = encoded_qualities[i] + 33;
}
-}
\ No newline at end of file
+}
=====================================
src/dnaio/chunks.py
=====================================
@@ -189,7 +189,7 @@ def read_paired_chunks(
start2 = f2.readinto(memoryview(buf2)[0:1])
if start1 == 0 and start2 == 0:
- return memoryview(b""), memoryview(b"")
+ return
if (start1 == 0) != (start2 == 0):
i = 2 if start1 == 0 else 1
=====================================
tests/test_chunks.py
=====================================
@@ -122,6 +122,15 @@ def test_fastq_head():
assert _fastq_head(b"A\nB\nC\nD\nE\nF\nG\nH\nI\n") == 16
+def test_read_paired_chunks_empty_input(tmp_path):
+ empty_fastq = tmp_path / "empty.fastq"
+ empty_fastq.write_text("")
+ with open(empty_fastq, "rb") as f1:
+ with open(empty_fastq, "rb") as f2:
+ chunks = list(read_paired_chunks(f1, f2))
+ assert len(chunks) == 0
+
+
def test_read_paired_chunks_fastq():
with open("tests/data/paired.1.fastq", "rb") as f1:
with open("tests/data/paired.2.fastq", "rb") as f2:
View it on GitLab: https://salsa.debian.org/med-team/python-dnaio/-/compare/52f500fdfd91d8e7dad21a49584823cc0d0eb8ac...4004a7ec7e123f1a818834812b418dad98804473
--
View it on GitLab: https://salsa.debian.org/med-team/python-dnaio/-/compare/52f500fdfd91d8e7dad21a49584823cc0d0eb8ac...4004a7ec7e123f1a818834812b418dad98804473
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20240831/798961d1/attachment-0001.htm>
More information about the debian-med-commit
mailing list