[med-svn] [Git][med-team/python-dnaio][upstream] New upstream version 1.0.1
Lance Lin (@linqigang)
gitlab at salsa.debian.org
Fri Oct 13 17:57:57 BST 2023
Lance Lin pushed to branch upstream at Debian Med / python-dnaio
Commits:
7d343205 by Lance Lin at 2023-10-13T19:36:03+07:00
New upstream version 1.0.1
- - - - -
21 changed files:
- + .github/workflows/ci.yml
- .readthedocs.yaml
- CHANGES.rst
- − NOTES.md
- pyproject.toml
- − setup.cfg
- setup.py
- src/dnaio/__init__.py
- src/dnaio/_core.pyi
- src/dnaio/_core.pyx
- src/dnaio/ascii_check.h
- − src/dnaio/ascii_check_sse2.h
- src/dnaio/chunks.py
- src/dnaio/interfaces.py
- src/dnaio/multipleend.py
- tests/test_chunks.py
- tests/test_internal.py
- tests/test_multiple.py
- tests/test_open.py
- tests/test_records.py
- tox.ini
Changes:
=====================================
.github/workflows/ci.yml
=====================================
@@ -0,0 +1,112 @@
+name: CI
+
+on: [push, pull_request]
+
+jobs:
+ lint:
+ # Run for PRs only if they come from a forked repo (avoids duplicate runs)
+ if: >-
+ github.event_name != 'pull_request' ||
+ github.event.pull_request.head.repo.full_name != github.event.pull_request.base.repo.full_name
+ timeout-minutes: 10
+ runs-on: ubuntu-latest
+ strategy:
+ matrix:
+ python-version: ["3.10"]
+ toxenv: [flake8, black, mypy, docs]
+ steps:
+ - uses: actions/checkout at v3
+ - name: Set up Python ${{ matrix.python-version }}
+ uses: actions/setup-python at v4
+ with:
+ python-version: ${{ matrix.python-version }}
+ - name: Install tox
+ run: python -m pip install tox
+ - name: Run tox ${{ matrix.toxenv }}
+ run: tox -e ${{ matrix.toxenv }}
+
+ build:
+ if: >-
+ github.event_name != 'pull_request' ||
+ github.event.pull_request.head.repo.full_name != github.event.pull_request.base.repo.full_name
+ runs-on: ubuntu-latest
+ steps:
+ - uses: actions/checkout at v3
+ with:
+ fetch-depth: 0 # required for setuptools_scm
+ - name: Build sdist and temporary wheel
+ run: pipx run build
+ - uses: actions/upload-artifact at v3
+ with:
+ name: sdist
+ path: dist/*.tar.gz
+
+ test:
+ if: >-
+ github.event_name != 'pull_request' ||
+ github.event.pull_request.head.repo.full_name != github.event.pull_request.base.repo.full_name
+ timeout-minutes: 10
+ runs-on: ${{ matrix.os }}
+ strategy:
+ matrix:
+ os: [ubuntu-latest]
+ python-version: ["3.8", "3.9", "3.10", "3.11", "3.12"]
+ include:
+ - os: macos-latest
+ python-version: "3.10"
+ - os: windows-latest
+ python-version: "3.10"
+ steps:
+ - uses: actions/checkout at v3
+ - name: Set up Python ${{ matrix.python-version }}
+ uses: actions/setup-python at v4
+ with:
+ python-version: ${{ matrix.python-version }}
+ - name: Install tox
+ run: python -m pip install tox
+ - name: Test
+ run: tox -e py
+ - name: Upload coverage report
+ uses: codecov/codecov-action at v3
+
+ wheels:
+ if: github.event_name == 'push' && startsWith(github.ref, 'refs/tags')
+ needs: [lint, test]
+ timeout-minutes: 15
+ strategy:
+ matrix:
+ os: [ubuntu-latest, windows-latest, macos-latest]
+ runs-on: ${{ matrix.os }}
+ steps:
+ - uses: actions/checkout at v3
+ with:
+ fetch-depth: 0 # required for setuptools_scm
+ - name: Build wheels
+ uses: pypa/cibuildwheel at v2.16.2
+ env:
+ CIBW_BUILD: "cp*-manylinux_x86_64 cp3*-win_amd64 cp3*-macosx_x86_64"
+ CIBW_SKIP: "cp37-*"
+ - uses: actions/upload-artifact at v3
+ with:
+ name: wheels
+ path: wheelhouse/*.whl
+
+ publish:
+ if: github.event_name == 'push' && startsWith(github.ref, 'refs/tags')
+ needs: [build, wheels]
+ runs-on: ubuntu-latest
+ steps:
+ - uses: actions/download-artifact at v3
+ with:
+ name: sdist
+ path: dist/
+ - uses: actions/download-artifact at v3
+ with:
+ name: wheels
+ path: dist/
+ - name: Publish to PyPI
+ uses: pypa/gh-action-pypi-publish at v1.5.1
+ with:
+ password: ${{ secrets.pypi_password }}
+ #password: ${{ secrets.test_pypi_password }}
+ #repository_url: https://test.pypi.org/legacy/
=====================================
.readthedocs.yaml
=====================================
@@ -1,5 +1,8 @@
version: 2
-
+build:
+ os: "ubuntu-22.04"
+ tools:
+ python: "3.11"
python:
install:
- requirements: doc/requirements.txt
=====================================
CHANGES.rst
=====================================
@@ -2,6 +2,17 @@
Changelog
=========
+v1.0.1 (2023-10-06)
+-------------------
+
+* :pr:`120`: Improved type annotations.
+* Dropped support for Python 3.7
+* Added support for Python 3.12
+
+v1.0.0 (2023-09-06)
+-------------------
+* :pr:`110`: Added ``id`` and ``comment`` properties to ``SequenceRecord``.
+
v0.10.0 (2022-12-05)
--------------------
=====================================
NOTES.md deleted
=====================================
@@ -1,89 +0,0 @@
-- compressed
-- paired-end
-- interleaved
-- chunked
-- FASTA, FASTQ
-- BAM?
-
-
-import dnaio
-
-with dnaio.open('input.fastq.gz') as f:
- for record in f:
- print(record....)
-
-with dnaio.open('input.1.fastq.gz', 'input.2.fastq.gz') as f:
- for record in f:
- print(record....)
-
-
-Use cases
-
-- open FASTQ from path
-- open FASTA from path
-- open compressed FASTA or FASTQ (.gz, .bz2, .xz)
-- open paired-end data
-- open interleaved data
-- open file-like object (such as sys.stdin)
-- use custom sequence record class
-- autodetect file format from contents
-- write FASTQ
-- write FASTA
-- read FASTQ/FASTA chunks (multiple records)
-
-Issues
-
-- Binary vs text
-- Should SequenceRecord be immutable?
-
-
-TODO
-
-- Sequence.name should be Sequence.description or so (reserve .name for the part
- before the first space)
-- optimize writing
-- Documentation
-
-- Line endings
-- second header
-
-FASTQ chunks
-
-- need an index attribute
-- need a line_number attribute
-
-
-# API
-
-## Advertised
-
-- dnaio.open
-- Sequence(Record)
-- possibly SequencePair/PairedSequence?
-
-
-## Reader
-
-- FastqReader
-- FastaReader
-- PairedSequenceReader -> rename to PairedFastqReader?
-- InterleavedSequenceReader -> rename to InterleavedFastqReader
-
-
-## Writing
-
-class FastqWriter
-class FastaWriter
-class PairedSequenceWriter
-class InterleavedSequenceWriter
-
-
-## Chunking
-
-def find_fasta_record_end(buf, end):
-def find_fastq_record_end(buf, end=None):
-def read_chunks_from_file(f, buffer_size=4*1024**2):
-def read_paired_chunks(f, f2, buffer_size=4*1024**2):
-head
-fastq_head
-two_fastq_heads
=====================================
pyproject.toml
=====================================
@@ -2,6 +2,43 @@
requires = ["setuptools >= 52", "setuptools_scm >= 6.2", "Cython >= 0.29.20"]
build-backend = "setuptools.build_meta"
+[project]
+name = "dnaio"
+authors = [
+ {name = "Marcel Martin", email = "marcel.martin at scilifelab.se"},
+ {name = "Ruben Vorderman", email = "r.h.p.vorderman at lumc.nl"}
+]
+description = "Read and write FASTA and FASTQ files efficiently"
+readme = "README.rst"
+license = {text = "MIT"}
+classifiers = [
+ "Development Status :: 5 - Production/Stable",
+ "Intended Audience :: Science/Research",
+ "License :: OSI Approved :: MIT License",
+ "Programming Language :: Cython",
+ "Programming Language :: Python :: 3",
+ "Topic :: Scientific/Engineering :: Bio-Informatics"
+]
+requires-python = ">3.7"
+dependencies = [
+ "xopen >= 1.4.0"
+]
+dynamic = ["version"]
+
+[project.optional-dependencies]
+dev = [
+ "Cython",
+ "pytest"
+]
+
+[project.urls]
+"Homepage" = "https://dnaio.readthedocs.io/"
+"Changelog" = "https://dnaio.readthedocs.io/en/latest/changes.html"
+"Repository" = "https://github.com/marcelm/dnaio/"
+
+[tool.setuptools.exclude-package-data]
+dnaio = ["*.pyx"]
+
[tool.setuptools_scm]
write_to = "src/dnaio/_version.py"
@@ -16,3 +53,30 @@ test-command = ["cd {project}", "pytest tests"]
[[tool.cibuildwheel.overrides]]
select = "*-win*"
test-command = ["cd /d {project}", "pytest tests"]
+
+[tool.mypy]
+warn_unused_configs = true
+warn_redundant_casts = true
+warn_unused_ignores = true
+
+[tool.coverage.report]
+precision = 1
+exclude_also = [
+ "def __repr__",
+ "@overload",
+ "if TYPE_CHECKING:",
+]
+
+[tool.coverage.run]
+branch = true
+parallel = true
+include = [
+ "*/site-packages/dnaio/*",
+ "tests/*",
+]
+
+[tool.coverage.paths]
+source = [
+ "src/",
+ "*/site-packages/",
+]
=====================================
setup.cfg deleted
=====================================
@@ -1,37 +0,0 @@
-[metadata]
-name = dnaio
-author = Marcel Martin
-author_email = marcel.martin at scilifelab.se
-url = https://dnaio.readthedocs.io/
-description = Read and write FASTA and FASTQ files efficiently
-long_description = file: README.rst
-long_description_content_type = text/x-rst
-license = MIT
-project_urls =
- Changelog = https://dnaio.readthedocs.io/en/latest/changes.html
-classifiers =
- Development Status :: 5 - Production/Stable
- Intended Audience :: Science/Research
- License :: OSI Approved :: MIT License
- Programming Language :: Cython
- Programming Language :: Python :: 3
- Topic :: Scientific/Engineering :: Bio-Informatics
-
-[options]
-python_requires = >=3.7
-package_dir =
- =src
-packages = find:
-install_requires =
- xopen >= 1.4.0
-
-[options.packages.find]
-where = src
-
-[options.package_data]
-* = py.typed, *.pyi, *.h
-
-[options.extras_require]
-dev =
- Cython
- pytest
=====================================
setup.py
=====================================
@@ -1,11 +1,12 @@
import platform
-import sys
from setuptools import setup, Extension
import setuptools_scm # noqa Ensure it’s installed
-if platform.machine() == "x86_64" or platform.machine() == "AMD64":
- DEFINE_MACROS = [("USE_SSE2", None)]
+if platform.machine() == "AMD64":
+ # Macro is defined by default for clang and GCC on relevant targets, but
+ # not by MSVC.
+ DEFINE_MACROS = [("__SSE2__", 1)]
else:
DEFINE_MACROS = []
=====================================
src/dnaio/__init__.py
=====================================
@@ -32,7 +32,7 @@ __all__ = [
import functools
from os import PathLike
-from typing import Optional, Union, BinaryIO
+from typing import Optional, Union, BinaryIO, Literal, overload
from xopen import xopen
@@ -77,11 +77,138 @@ from ._version import version as __version__
# Backwards compatibility alias
Sequence = SequenceRecord
+_FileOrPath = Union[str, PathLike, BinaryIO]
+
+ at overload
+def open(
+ _file: _FileOrPath,
+ *,
+ fileformat: Optional[str] = ...,
+ interleaved: Literal[False] = ...,
+ mode: Literal["r"] = ...,
+ qualities: Optional[bool] = ...,
+ opener=...,
+ compression_level: int = ...,
+ open_threads: int = ...,
+) -> SingleEndReader:
+ ...
+
+
+ at overload
+def open(
+ _file1: _FileOrPath,
+ _file2: _FileOrPath,
+ *,
+ fileformat: Optional[str] = ...,
+ interleaved: Literal[False] = ...,
+ mode: Literal["r"] = ...,
+ qualities: Optional[bool] = ...,
+ opener=...,
+ compression_level: int = ...,
+ open_threads: int = ...,
+) -> PairedEndReader:
+ ...
+
+
+ at overload
+def open(
+ _file: _FileOrPath,
+ *,
+ interleaved: Literal[True],
+ fileformat: Optional[str] = ...,
+ mode: Literal["r"] = ...,
+ qualities: Optional[bool] = ...,
+ opener=...,
+ compression_level: int = ...,
+ open_threads: int = ...,
+) -> PairedEndReader:
+ ...
+
+
+ at overload
+def open(
+ _file1: _FileOrPath,
+ _file2: _FileOrPath,
+ _file3: _FileOrPath,
+ *files: _FileOrPath,
+ fileformat: Optional[str] = ...,
+ mode: Literal["r"] = ...,
+ qualities: Optional[bool] = ...,
+ opener=...,
+ compression_level: int = ...,
+ open_threads: int = ...,
+) -> MultipleFileReader:
+ ...
+
+
+ at overload
+def open(
+ _file: _FileOrPath,
+ *,
+ mode: Literal["w", "a"],
+ fileformat: Optional[str] = ...,
+ interleaved: Literal[False] = ...,
+ qualities: Optional[bool] = ...,
+ opener=...,
+ compression_level: int = ...,
+ open_threads: int = ...,
+) -> SingleEndWriter:
+ ...
+
+
+ at overload
def open(
- *files: Union[str, PathLike, BinaryIO],
- file1: Optional[Union[str, PathLike, BinaryIO]] = None,
- file2: Optional[Union[str, PathLike, BinaryIO]] = None,
+ _file1: _FileOrPath,
+ _file2: _FileOrPath,
+ *,
+ mode: Literal["w", "a"],
+ fileformat: Optional[str] = ...,
+ interleaved: Literal[False] = ...,
+ qualities: Optional[bool] = ...,
+ opener=...,
+ compression_level: int = ...,
+ open_threads: int = ...,
+) -> PairedEndWriter:
+ ...
+
+
+ at overload
+def open(
+ _file: _FileOrPath,
+ *,
+ mode: Literal["w", "a"],
+ interleaved: Literal[True],
+ fileformat: Optional[str] = ...,
+ qualities: Optional[bool] = ...,
+ opener=...,
+ compression_level: int = ...,
+ open_threads: int = ...,
+) -> PairedEndWriter:
+ ...
+
+
+ at overload
+def open(
+ _file1: _FileOrPath,
+ _file2: _FileOrPath,
+ _file3: _FileOrPath,
+ *files: _FileOrPath,
+ mode: Literal["w", "a"],
+ fileformat: Optional[str] = ...,
+ interleaved: Literal[False] = ...,
+ qualities: Optional[bool] = ...,
+ opener=...,
+ compression_level: int = ...,
+ open_threads: int = ...,
+) -> MultipleFileWriter:
+ ...
+
+
+def open(
+ *files: _FileOrPath,
+ file1: Optional[_FileOrPath] = None,
+ file2: Optional[_FileOrPath] = None,
fileformat: Optional[str] = None,
interleaved: bool = False,
mode: str = "r",
@@ -89,6 +216,7 @@ def open(
opener=xopen,
compression_level: int = 1,
open_threads: int = 0,
+ **_kwargs, # TODO Can we get rid of this? Only here to satisfy type checker
) -> Union[
SingleEndReader,
PairedEndReader,
@@ -150,9 +278,14 @@ def open(
this parameter. This parameter does not work when a custom opener is
set.
"""
- if files and (file1 is not None):
+ if files and (file1 is not None) and (file2 is not None):
+ raise ValueError(
+ "file1 and file2 arguments cannot be used together with files specified "
+ "as positional arguments"
+ )
+ elif files and (file1 is not None):
raise ValueError(
- "The file1 keyword argument cannot be used together with files specified"
+ "The file1 keyword argument cannot be used together with files specified "
"as positional arguments"
)
elif len(files) > 1 and file2 is not None:
@@ -160,16 +293,16 @@ def open(
"The file2 argument cannot be used together with more than one "
"file specified as positional argument"
)
- elif file1 is not None and file2 is not None and files:
- raise ValueError(
- "file1 and file2 arguments cannot be used together with files specified"
- "as positional arguments"
- )
elif file1 is not None and file2 is not None:
files = (file1, file2)
- elif file2 is not None and len(files) == 1:
+ elif file1 is not None:
+ files = (file1,)
+ elif len(files) == 1 and file2 is not None:
files = (files[0], file2)
+ del file1
+ del file2
+
if len(files) > 1 and interleaved:
raise ValueError("When interleaved is True, only one file must be specified.")
elif mode not in ("r", "w", "a"):
=====================================
src/dnaio/_core.pyi
=====================================
@@ -24,6 +24,10 @@ class SequenceRecord:
def fastq_bytes(self, two_headers: bool = ...) -> bytes: ...
def is_mate(self, other: SequenceRecord) -> bool: ...
def reverse_complement(self) -> SequenceRecord: ...
+ @property
+ def id(self) -> str: ...
+ @property
+ def comment(self) -> Optional[str]: ...
# Bytestring = Union[bytes, bytearray, memoryview]. Technically just 'bytes' is
# acceptable as an alias, but even more technically this function supports all
@@ -35,7 +39,7 @@ def paired_fastq_heads(
def records_are_mates(
__first_record: SequenceRecord,
__second_record: SequenceRecord,
- *__other_records: SequenceRecord
+ *__other_records: SequenceRecord,
) -> bool: ...
T = TypeVar("T")
@@ -51,3 +55,6 @@ class FastqIter(Generic[T]):
# Deprecated
def record_names_match(header1: str, header2: str) -> bool: ...
+
+# Private
+def bytes_ascii_check(b: bytes, length: int = -1) -> bool: ...
=====================================
src/dnaio/_core.pyx
=====================================
@@ -7,7 +7,7 @@ from cpython.unicode cimport PyUnicode_CheckExact, PyUnicode_GET_LENGTH, PyUnico
from cpython.object cimport Py_TYPE, PyTypeObject
from cpython.ref cimport PyObject
from cpython.tuple cimport PyTuple_GET_ITEM
-from libc.string cimport memcmp, memcpy, memchr, strcspn, memmove
+from libc.string cimport memcmp, memcpy, memchr, strcspn, strspn, memmove
cimport cython
cdef extern from "Python.h":
@@ -15,14 +15,7 @@ cdef extern from "Python.h":
bint PyUnicode_IS_COMPACT_ASCII(object o)
object PyUnicode_New(Py_ssize_t size, Py_UCS4 maxchar)
-cdef extern from *:
- """
- #if defined(USE_SSE2)
- #include "ascii_check_sse2.h"
- #else
- #include "ascii_check.h"
- #endif
- """
+cdef extern from "ascii_check.h":
int string_is_ascii(char *string, size_t length)
cdef extern from "_conversions.h":
@@ -84,6 +77,8 @@ cdef class SequenceRecord:
object _name
object _sequence
object _qualities
+ object _id
+ object _comment
def __init__(self, object name, object sequence, object qualities = None):
if not PyUnicode_CheckExact(name):
@@ -119,6 +114,8 @@ cdef class SequenceRecord:
if not PyUnicode_IS_COMPACT_ASCII(name):
raise ValueError(is_not_ascii_message("name", name))
self._name = name
+ self._id = None
+ self._comment = None
@property
def sequence(self):
@@ -150,6 +147,59 @@ cdef class SequenceRecord:
)
self._qualities = qualities
+ @property
+ def id(self):
+ """
+ The header part before any whitespace. This is the unique identifier
+ for the sequence.
+ """
+ cdef char *name
+ cdef size_t name_length
+ cdef size_t id_length
+ # Not yet cached is None
+ if self._id is None:
+ name = <char *>PyUnicode_DATA(self._name)
+ name_length = <size_t>PyUnicode_GET_LENGTH(self._name)
+ id_length = strcspn(name, "\t ")
+ if id_length == name_length:
+ self._id = self._name
+ else:
+ self._id = PyUnicode_New(id_length, 127)
+ memcpy(PyUnicode_DATA(self._id), name, id_length)
+ return self._id
+
+ @property
+ def comment(self):
+ """
+ The header part after the first whitespace. This is usually used
+ to store metadata. It may be empty in which case the attribute is None.
+ """
+ cdef char *name
+ cdef size_t name_length
+ cdef size_t id_length
+ cdef char *comment_start
+ cdef size_t comment_length
+ # Not yet cached is None
+ if self._comment is None:
+ name = <char *>PyUnicode_DATA(self._name)
+ name_length = <size_t>PyUnicode_GET_LENGTH(self._name)
+ id_length = strcspn(name, "\t ")
+ if id_length == name_length:
+ self._comment = ""
+ else:
+ comment_start = name + id_length + 1
+ # Skip empty whitespace before comment
+ comment_start = comment_start + strspn(comment_start, '\t ')
+ comment_length = name_length - (comment_start - name)
+ self._comment = PyUnicode_New(comment_length , 127)
+ memcpy(PyUnicode_DATA(self._comment), comment_start, comment_length)
+ # Empty comment is returned as None. This is not stored internally as
+ # None, otherwise the above code would run every time the attribute
+ # was accessed.
+ if PyUnicode_GET_LENGTH(self._comment) == 0:
+ return None
+ return self._comment
+
def __getitem__(self, key):
"""
Slice this SequenceRecord. If the qualities attribute is not None, it is
=====================================
src/dnaio/ascii_check.h
=====================================
@@ -1,34 +1,48 @@
-#define ASCII_MASK_8BYTE 0x8080808080808080ULL
-#define ASCII_MASK_1BYTE 0x80
-
#include <stddef.h>
#include <stdint.h>
+#ifdef __SSE2__
+#include "emmintrin.h"
+#endif
+#define ASCII_MASK_8BYTE 0x8080808080808080ULL
+#define ASCII_MASK_1BYTE 0x80
+
+/**
+ * @brief Check if a string of given length only contains ASCII characters.
+ *
+ * @param string A char pointer to the start of the string.
+ * @param length The length of the string. This funtion does not check for
+ * terminating NULL bytes.
+ * @returns 1 if the string is ASCII-only, 0 otherwise.
+ */
static int
-string_is_ascii(char * string, size_t length) {
- size_t n = length;
+string_is_ascii(const char * string, size_t length) {
+ // By performing bitwise OR on all characters in 8-byte chunks (16-byte
+ // with SSE2) we can
+ // determine ASCII status in a non-branching (except the loops) fashion.
uint64_t all_chars = 0;
- char * char_ptr = string;
- // The first loop aligns the memory address. Char_ptr is cast to a size_t
- // to return the memory address. Uint64_t is 8 bytes long, and the processor
- // handles this better when its address is a multiplier of 8. This loops
- // handles the first few bytes that are not on such a multiplier boundary.
- while ((size_t)char_ptr % sizeof(uint64_t) && n != 0) {
- all_chars |= *char_ptr;
- char_ptr += 1;
- n -= 1;
+ const char *cursor = string;
+ const char *string_end_ptr = string + length;
+ const char *string_8b_end_ptr = string_end_ptr - sizeof(uint64_t);
+ int non_ascii_in_vec = 0;
+ #ifdef __SSE2__
+ const char *string_16b_end_ptr = string_end_ptr - sizeof(__m128i);
+ __m128i vec_all_chars = _mm_setzero_si128();
+ while (cursor < string_16b_end_ptr) {
+ __m128i loaded_chars = _mm_loadu_si128((__m128i *)cursor);
+ vec_all_chars = _mm_or_si128(loaded_chars, vec_all_chars);
+ cursor += sizeof(__m128i);
}
- uint64_t *longword_ptr = (uint64_t *)char_ptr;
- while (n >= sizeof(uint64_t)) {
- all_chars |= *longword_ptr;
- longword_ptr += 1;
- n -= sizeof(uint64_t);
+ non_ascii_in_vec = _mm_movemask_epi8(vec_all_chars);
+ #endif
+
+ while (cursor < string_8b_end_ptr) {
+ all_chars |= *(uint64_t *)cursor;
+ cursor += sizeof(uint64_t);
}
- char_ptr = (char *)longword_ptr;
- while (n != 0) {
- all_chars |= *char_ptr;
- char_ptr += 1;
- n -= 1;
+ while (cursor < string_end_ptr) {
+ all_chars |= *cursor;
+ cursor += 1;
}
- return !(all_chars & ASCII_MASK_8BYTE);
+ return !(non_ascii_in_vec + (all_chars & ASCII_MASK_8BYTE));
}
=====================================
src/dnaio/ascii_check_sse2.h deleted
=====================================
@@ -1,67 +0,0 @@
-// Copyright (c) 2022 Leiden University Medical Center
-
-// Permission is hereby granted, free of charge, to any person obtaining a copy
-// of this software and associated documentation files (the "Software"), to
-// deal in the Software without restriction, including without limitation the
-// rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
-// sell copies of the Software, and to permit persons to whom the Software is
-// furnished to do so, subject to the following conditions:
-
-// The above copyright notice and this permission notice shall be included in
-// all copies or substantial portions of the Software.
-
-// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
-// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
-// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
-// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
-// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
-// FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
-// IN THE SOFTWARE.
-
-// This file is maintained and tested at
-// https://github.com/rhpvorderman/ascii-check
-// Please report bugs and feature requests there.
-
-#include <stddef.h>
-#include <stdint.h>
-#include <emmintrin.h>
-
-#define ASCII_MASK_1BYTE 0x80
-
-/**
- * @brief Check if a string of given length only contains ASCII characters.
- *
- * @param string A char pointer to the start of the string.
- * @param length The length of the string. This funtion does not check for
- * terminating NULL bytes.
- * @returns 1 if the string is ASCII-only, 0 otherwise.
- */
-static int
-string_is_ascii(const char * string, size_t length) {
- size_t n = length;
- const char * char_ptr = string;
- typedef __m128i longword;
- char all_chars = 0;
- longword all_words = _mm_setzero_si128();
-
- // First align the memory adress
- while ((size_t)char_ptr % sizeof(longword) && n != 0) {
- all_chars |= *char_ptr;
- char_ptr += 1;
- n -= 1;
- }
- const longword * longword_ptr = (longword *)char_ptr;
- while (n >= sizeof(longword)) {
- all_words = _mm_or_si128(all_words, *longword_ptr);
- longword_ptr += 1;
- n -= sizeof(longword);
- }
- char_ptr = (char *)longword_ptr;
- while (n != 0) {
- all_chars |= *char_ptr;
- char_ptr += 1;
- n -= 1;
- }
- // Check the most significant bits in the accumulated words and chars.
- return !(_mm_movemask_epi8(all_words) || (all_chars & ASCII_MASK_1BYTE));
-}
=====================================
src/dnaio/chunks.py
=====================================
@@ -151,15 +151,15 @@ def read_paired_chunks(
Raises:
ValueError: A FASTQ record was encountered that is larger than *buffer_size*.
"""
- if buffer_size < 1:
+ if buffer_size < 6:
raise ValueError("Buffer size too small")
buf1 = bytearray(buffer_size)
buf2 = bytearray(buffer_size)
# Read one byte to make sure we are processing FASTQ
- start1 = f.readinto(memoryview(buf1)[0:1]) # type: ignore
- start2 = f2.readinto(memoryview(buf2)[0:1]) # type: ignore
+ start1 = f.readinto(memoryview(buf1)[0:1])
+ start2 = f2.readinto(memoryview(buf2)[0:1])
if (start1 == 1 and buf1[0:1] != b"@") or (start2 == 1 and buf2[0:1] != b"@"):
raise FileFormatError(
"Paired-end data must be in FASTQ format when using multiple cores",
@@ -167,8 +167,10 @@ def read_paired_chunks(
)
while True:
- if start1 == len(buf1) or start2 == len(buf2):
- raise ValueError("FASTQ record does not fit into buffer")
+ if start1 == len(buf1) and start2 == len(buf2):
+ raise ValueError(
+ f"FASTQ records do not fit into buffer of size {buffer_size}"
+ )
bufend1 = f.readinto(memoryview(buf1)[start1:]) + start1 # type: ignore
bufend2 = f2.readinto(memoryview(buf2)[start2:]) + start2 # type: ignore
if start1 == bufend1 and start2 == bufend2:
@@ -180,6 +182,15 @@ def read_paired_chunks(
if end1 > 0 or end2 > 0:
yield (memoryview(buf1)[0:end1], memoryview(buf2)[0:end2])
+ else:
+ assert end1 == 0 and end2 == 0
+ extra = ""
+ if bufend1 == 0 or bufend2 == 0:
+ i = 1 if bufend1 == 0 else 2
+ extra = f". File {i} ended, but more data found in the other file"
+ raise FileFormatError(
+ f"Premature end of paired FASTQ input{extra}.", line=None
+ )
start1 = bufend1 - end1
assert start1 >= 0
buf1[0:start1] = buf1[end1:bufend1]
=====================================
src/dnaio/interfaces.py
=====================================
@@ -1,10 +1,11 @@
-from abc import ABC, abstractmethod
+from abc import abstractmethod
+from contextlib import AbstractContextManager
from typing import Iterable, Iterator, Tuple
from dnaio import SequenceRecord
-class SingleEndReader(ABC):
+class SingleEndReader(AbstractContextManager):
delivers_qualities: bool
number_of_records: int
@@ -21,8 +22,12 @@ class SingleEndReader(ABC):
if there was a parse error
"""
+ @abstractmethod
+ def close(self) -> None:
+ pass
+
-class PairedEndReader(ABC):
+class PairedEndReader(AbstractContextManager):
@abstractmethod
def __iter__(self) -> Iterator[Tuple[SequenceRecord, SequenceRecord]]:
"""
@@ -39,14 +44,22 @@ class PairedEndReader(ABC):
`SequenceRecord.is_mate`).
"""
+ @abstractmethod
+ def close(self) -> None:
+ pass
-class SingleEndWriter(ABC):
+
+class SingleEndWriter(AbstractContextManager):
@abstractmethod
def write(self, record: SequenceRecord) -> None:
"""Write a `SequenceRecord` to the output."""
+ @abstractmethod
+ def close(self) -> None:
+ pass
+
-class PairedEndWriter(ABC):
+class PairedEndWriter(AbstractContextManager):
@abstractmethod
def write(self, record1: SequenceRecord, record2: SequenceRecord) -> None:
"""
@@ -59,8 +72,12 @@ class PairedEndWriter(ABC):
this method.
"""
+ @abstractmethod
+ def close(self) -> None:
+ pass
+
-class MultipleFileWriter(ABC):
+class MultipleFileWriter(AbstractContextManager):
_number_of_files: int
@abstractmethod
@@ -83,3 +100,7 @@ class MultipleFileWriter(ABC):
This method may provide a speed boost over calling write for each
tuple of SequenceRecords individually.
"""
+
+ @abstractmethod
+ def close(self) -> None:
+ pass
=====================================
src/dnaio/multipleend.py
=====================================
@@ -58,9 +58,7 @@ class MultipleFileReader:
self._stack = contextlib.ExitStack()
self._readers: List[Union[FastaReader, FastqReader]] = [
self._stack.enter_context(
- _open_single( # type: ignore
- file, opener=opener, fileformat=fileformat, mode="r"
- )
+ _open_single(file, opener=opener, fileformat=fileformat, mode="r")
)
for file in self._files
]
@@ -145,7 +143,7 @@ class MultipleFastaWriter(MultipleFileWriter):
self._stack = contextlib.ExitStack()
self._writers: List[Union[FastaWriter, FastqWriter]] = [
self._stack.enter_context(
- _open_single( # type: ignore
+ _open_single(
file,
opener=opener,
fileformat="fasta",
=====================================
tests/test_chunks.py
=====================================
@@ -1,7 +1,7 @@
from pytest import raises
from io import BytesIO
-from dnaio import UnknownFileFormat
+from dnaio import UnknownFileFormat, FileFormatError
from dnaio._core import paired_fastq_heads
from dnaio.chunks import _fastq_head, _fasta_head, read_chunks, read_paired_chunks
@@ -74,6 +74,17 @@ def test_read_paired_chunks():
print(c1, c2)
+def test_paired_chunks_different_number_of_records():
+ record = b"@r\nAA\n+\n##\n"
+ buf1 = record
+ buf2 = record * 3
+ it = read_paired_chunks(BytesIO(buf1), BytesIO(buf2), 16)
+ assert next(it) == (record, record)
+ with raises(FileFormatError) as error:
+ next(it)
+ error.match("more data found in the other file")
+
+
def test_read_chunks():
for data in [b"@r1\nACG\n+\nHHH\n", b">r1\nACGACGACG\n"]:
assert [m.tobytes() for m in read_chunks(BytesIO(data))] == [data]
=====================================
tests/test_internal.py
=====================================
@@ -28,6 +28,8 @@ from dnaio import (
)
from dnaio.writers import FileWriter
from dnaio.readers import BinaryFileReader
+from dnaio._core import bytes_ascii_check
+
TEST_DATA = Path(__file__).parent / "data"
SIMPLE_FASTQ = str(TEST_DATA / "simple.fastq")
@@ -48,12 +50,12 @@ class TestFastaReader:
reads = list(f)
assert reads == simple_fasta
- def test_bytesio(self):
+ def test_bytesio(self) -> None:
fasta = BytesIO(b">first_sequence\nSEQUENCE1\n>second_sequence\nSEQUENCE2\n")
reads = list(FastaReader(fasta))
assert reads == simple_fasta
- def test_with_comments(self):
+ def test_with_comments(self) -> None:
fasta = BytesIO(
dedent(
"""
@@ -69,7 +71,7 @@ class TestFastaReader:
reads = list(FastaReader(fasta))
assert reads == simple_fasta
- def test_wrong_format(self):
+ def test_wrong_format(self) -> None:
fasta = BytesIO(
dedent(
"""# a comment
@@ -86,13 +88,13 @@ class TestFastaReader:
list(FastaReader(fasta))
assert info.value.line == 2
- def test_fastareader_keeplinebreaks(self):
+ def test_fastareader_keeplinebreaks(self) -> None:
with FastaReader("tests/data/simple.fasta", keep_linebreaks=True) as f:
reads = list(f)
assert reads[0] == simple_fasta[0]
assert reads[1].sequence == "SEQUEN\nCE2"
- def test_context_manager(self):
+ def test_context_manager(self) -> None:
filename = "tests/data/simple.fasta"
with open(filename, "rb") as f:
assert not f.closed
@@ -113,24 +115,24 @@ class TestFastaReader:
class TestFastqReader:
- def test_fastqreader(self):
+ def test_fastqreader(self) -> None:
with FastqReader(SIMPLE_FASTQ) as f:
reads = list(f)
assert reads == simple_fastq
@mark.parametrize("buffer_size", [1, 2, 3, 5, 7, 10, 20])
- def test_fastqreader_buffersize(self, buffer_size):
+ def test_fastqreader_buffersize(self, buffer_size) -> None:
with FastqReader("tests/data/simple.fastq", buffer_size=buffer_size) as f:
reads = list(f)
assert reads == simple_fastq
- def test_fastqreader_buffersize_too_small(self):
+ def test_fastqreader_buffersize_too_small(self) -> None:
with raises(ValueError) as e:
with FastqReader("tests/data/simple.fastq", buffer_size=0) as f:
_ = list(f) # pragma: no cover
assert "buffer size too small" in e.value.args[0]
- def test_fastqreader_dos(self):
+ def test_fastqreader_dos(self) -> None:
# DOS line breaks
with open("tests/data/dos.fastq", "rb") as f:
assert b"\r\n" in f.read()
@@ -140,13 +142,13 @@ class TestFastqReader:
unix_reads = list(f)
assert dos_reads == unix_reads
- def test_fastq_wrongformat(self):
+ def test_fastq_wrongformat(self) -> None:
with raises(FastqFormatError) as info:
with FastqReader("tests/data/withplus.fastq") as f:
list(f) # pragma: no cover
assert info.value.line == 2
- def test_empty_fastq(self):
+ def test_empty_fastq(self) -> None:
with FastqReader(BytesIO(b"")) as fq:
assert list(fq) == []
@@ -175,14 +177,14 @@ class TestFastqReader:
(b"@r1\nACG\n+\nHHH\n at r2\nT\n+\n", 7),
],
)
- def test_fastq_incomplete(self, s, line):
+ def test_fastq_incomplete(self, s, line) -> None:
fastq = BytesIO(s)
with raises(FastqFormatError) as info:
with FastqReader(fastq) as fq:
list(fq)
assert info.value.line == line
- def test_half_record_line_numbers(self):
+ def test_half_record_line_numbers(self) -> None:
fastq = BytesIO(b"@r\nACG\n+\nHH\n")
# Choose the buffer size such that only parts of the record fit
# We want to ensure that the line number is reset properly
@@ -203,21 +205,21 @@ class TestFastqReader:
(b"@r1\nACG\n+\nHHH\n at r2\nT\n+\n\n", 7),
],
)
- def test_differing_lengths(self, s, line):
+ def test_differing_lengths(self, s, line) -> None:
fastq = BytesIO(s)
with raises(FastqFormatError) as info:
with FastqReader(fastq) as fq:
list(fq)
assert info.value.line == line
- def test_missing_final_newline(self):
+ def test_missing_final_newline(self) -> None:
# Files with a missing final newline are currently allowed
fastq = BytesIO(b"@r1\nA\n+\nH")
with dnaio.open(fastq) as f:
records = list(f)
assert records == [SequenceRecord("r1", "A", "H")]
- def test_non_ascii_in_record(self):
+ def test_non_ascii_in_record(self) -> None:
# \xc4 -> Ä
fastq = BytesIO(b"@r1\n\xc4\n+\nH")
with pytest.raises(FastqFormatError) as e:
@@ -225,17 +227,19 @@ class TestFastqReader:
list(f)
e.match("Non-ASCII")
- def test_not_opened_as_binary(self):
+ def test_not_opened_as_binary(self) -> None:
filename = "tests/data/simple.fastq"
with open(filename, "rt") as f:
with raises(ValueError):
- list(dnaio.open(f))
+ list(dnaio.open(f)) # type: ignore
- def test_context_manager(self):
+ def test_context_manager(self) -> None:
filename = "tests/data/simple.fastq"
with open(filename, "rb") as f:
assert not f.closed
- _ = list(dnaio.open(f))
+ reader = dnaio.open(f)
+ assert isinstance(reader, FastqReader)
+ _ = list(reader)
assert not f.closed
assert f.closed
@@ -246,7 +250,7 @@ class TestFastqReader:
assert not sr._file.closed
assert tmp_sr._file is None
- def test_two_header_detection(self):
+ def test_two_header_detection(self) -> None:
fastq = BytesIO(b"@r1\nACG\n+r1\nHHH\n at r2\nT\n+r2\n#\n")
with FastqReader(fastq) as fq:
assert fq.two_headers
@@ -257,7 +261,7 @@ class TestFastqReader:
assert not fq.two_headers
list(fq)
- def test_second_header_not_equal(self):
+ def test_second_header_not_equal(self) -> None:
fastq = BytesIO(b"@r1\nACG\n+xy\nXXX\n")
with raises(FastqFormatError) as info:
with FastqReader(fastq) as fq:
@@ -272,7 +276,7 @@ class TestOpen:
def teardown_method(self):
shutil.rmtree(self._tmpdir)
- def test_sequence_reader(self):
+ def test_sequence_reader(self) -> None:
# test the autodetection
with dnaio.open("tests/data/simple.fastq") as f:
reads = list(f)
@@ -298,7 +302,7 @@ class TestOpen:
reads = list(dnaio.open(bio))
assert reads == simple_fasta
- def test_autodetect_fasta_format(self, tmpdir):
+ def test_autodetect_fasta_format(self, tmpdir) -> None:
path = str(tmpdir.join("tmp.fasta"))
with dnaio.open(path, mode="w") as f:
assert isinstance(f, FastaWriter)
@@ -308,7 +312,7 @@ class TestOpen:
records = list(f)
assert records == simple_fasta
- def test_write_qualities_to_fasta(self):
+ def test_write_qualities_to_fasta(self) -> None:
path = os.path.join(self._tmpdir, "tmp.fasta")
with dnaio.open(path, mode="w", qualities=True) as f:
assert isinstance(f, FastaWriter)
@@ -317,7 +321,7 @@ class TestOpen:
with dnaio.open(path) as f:
assert list(f) == simple_fasta
- def test_autodetect_fastq_format(self):
+ def test_autodetect_fastq_format(self) -> None:
path = os.path.join(self._tmpdir, "tmp.fastq")
with dnaio.open(path, mode="w") as f:
assert isinstance(f, FastqWriter)
@@ -326,7 +330,7 @@ class TestOpen:
with dnaio.open(path) as f:
assert list(f) == simple_fastq
- def test_autodetect_fastq_weird_name(self):
+ def test_autodetect_fastq_weird_name(self) -> None:
path = os.path.join(self._tmpdir, "tmp.fastq.gz")
with dnaio.open(path, mode="w") as f:
assert isinstance(f, FastqWriter)
@@ -337,7 +341,7 @@ class TestOpen:
with dnaio.open(weird_path) as f:
assert list(f) == simple_fastq
- def test_fastq_qualities_missing(self):
+ def test_fastq_qualities_missing(self) -> None:
path = os.path.join(self._tmpdir, "tmp.fastq")
with raises(ValueError):
with dnaio.open(path, mode="w", qualities=False):
@@ -345,7 +349,7 @@ class TestOpen:
class TestInterleavedReader:
- def test(self):
+ def test(self) -> None:
expected = [
(
SequenceRecord(
@@ -372,14 +376,14 @@ class TestInterleavedReader:
reads = list(f)
assert reads == expected
- def test_missing_partner(self):
+ def test_missing_partner(self) -> None:
s = BytesIO(b"@r1\nACG\n+\nHHH\n")
with raises(FileFormatError) as info:
with InterleavedPairedEndReader(s) as isr:
list(isr)
assert "Interleaved input file incomplete" in info.value.message
- def test_incorrectly_paired(self):
+ def test_incorrectly_paired(self) -> None:
s = BytesIO(b"@r1/1\nACG\n+\nHHH\n at wrong_name\nTTT\n+\nHHH\n")
with raises(FileFormatError) as info:
with InterleavedPairedEndReader(s) as isr:
@@ -395,7 +399,7 @@ class TestFastaWriter:
def teardown_method(self):
shutil.rmtree(self._tmpdir)
- def test(self):
+ def test(self) -> None:
with FastaWriter(self.path) as fw:
fw.write("name", "CCATA")
fw.write("name2", "HELLO")
@@ -403,7 +407,7 @@ class TestFastaWriter:
with open(self.path) as t:
assert t.read() == ">name\nCCATA\n>name2\nHELLO\n"
- def test_linelength(self):
+ def test_linelength(self) -> None:
with FastaWriter(self.path, line_length=3) as fw:
fw.write("r1", "ACG")
fw.write("r2", "CCAT")
@@ -413,7 +417,7 @@ class TestFastaWriter:
d = t.read()
assert d == ">r1\nACG\n>r2\nCCA\nT\n>r3\nTAC\nCAG\n"
- def test_write_sequence_object(self):
+ def test_write_sequence_object(self) -> None:
with FastaWriter(self.path) as fw:
fw.write(SequenceRecord("name", "CCATA"))
fw.write(SequenceRecord("name2", "HELLO"))
@@ -421,7 +425,7 @@ class TestFastaWriter:
with open(self.path) as t:
assert t.read() == ">name\nCCATA\n>name2\nHELLO\n"
- def test_write_to_file_like_object(self):
+ def test_write_to_file_like_object(self) -> None:
bio = BytesIO()
with FastaWriter(bio) as fw:
fw.write(SequenceRecord("name", "CCATA"))
@@ -430,7 +434,7 @@ class TestFastaWriter:
assert not bio.closed
assert not fw._file.closed
- def test_write_zero_length_sequence_record(self):
+ def test_write_zero_length_sequence_record(self) -> None:
bio = BytesIO()
with FastaWriter(bio) as fw:
fw.write(SequenceRecord("name", ""))
@@ -445,7 +449,7 @@ class TestFastqWriter:
def teardown_method(self):
shutil.rmtree(self._tmpdir)
- def test(self):
+ def test(self) -> None:
with FastqWriter(self.path) as fq:
fq.writeseq("name", "CCATA", "!#!#!")
fq.writeseq("name2", "HELLO", "&&&!&&")
@@ -453,7 +457,7 @@ class TestFastqWriter:
with open(self.path) as t:
assert t.read() == "@name\nCCATA\n+\n!#!#!\n at name2\nHELLO\n+\n&&&!&&\n"
- def test_twoheaders(self):
+ def test_twoheaders(self) -> None:
with FastqWriter(self.path, two_headers=True) as fq:
fq.write(SequenceRecord("name", "CCATA", "!#!#!"))
fq.write(SequenceRecord("name2", "HELLO", "&&&!&"))
@@ -463,7 +467,7 @@ class TestFastqWriter:
t.read() == "@name\nCCATA\n+name\n!#!#!\n at name2\nHELLO\n+name2\n&&&!&\n"
)
- def test_write_to_file_like_object(self):
+ def test_write_to_file_like_object(self) -> None:
bio = BytesIO()
with FastqWriter(bio) as fq:
fq.writeseq("name", "CCATA", "!#!#!")
@@ -472,7 +476,7 @@ class TestFastqWriter:
class TestInterleavedWriter:
- def test(self):
+ def test(self) -> None:
reads = [
(
SequenceRecord("A/1 comment", "TTA", "##H"),
@@ -493,7 +497,7 @@ class TestInterleavedWriter:
class TestPairedSequenceReader:
- def test_read(self):
+ def test_read(self) -> None:
s1 = BytesIO(b"@r1\nACG\n+\nHHH\n")
s2 = BytesIO(b"@r2\nGTT\n+\n858\n")
with TwoFilePairedEndReader(s1, s2) as psr:
@@ -504,7 +508,7 @@ class TestPairedSequenceReader:
),
] == list(psr)
- def test_record_names_match(self):
+ def test_record_names_match(self) -> None:
match = record_names_match
assert match("abc", "abc")
assert match("abc def", "abc")
@@ -517,7 +521,7 @@ class TestPairedSequenceReader:
assert match("abc\tcomments comments", "abc\tothers others")
assert match("abc\tdef", "abc def")
- def test_record_names_match_with_ignored_trailing_12(self):
+ def test_record_names_match_with_ignored_trailing_12(self) -> None:
match = record_names_match
assert match("abc/1", "abc/2")
assert match("abc.1", "abc.2")
@@ -531,13 +535,13 @@ class TestPairedSequenceReader:
assert not match("abc", "abc1")
assert not match("abc", "abc2")
- def test_record_names_match_with_ignored_trailing_123(self):
+ def test_record_names_match_with_ignored_trailing_123(self) -> None:
match = record_names_match
assert match("abc/1", "abc/3")
assert match("abc.1 def", "abc.3 ghi")
assert match("abc.3 def", "abc.1 ghi")
- def test_missing_partner1(self):
+ def test_missing_partner1(self) -> None:
s1 = BytesIO(b"")
s2 = BytesIO(b"@r1\nACG\n+\nHHH\n")
@@ -546,7 +550,7 @@ class TestPairedSequenceReader:
list(psr)
assert "There are more reads in file 2 than in file 1" in info.value.message
- def test_missing_partner2(self):
+ def test_missing_partner2(self) -> None:
s1 = BytesIO(b"@r1\nACG\n+\nHHH\n")
s2 = BytesIO(b"")
@@ -555,7 +559,7 @@ class TestPairedSequenceReader:
list(psr)
assert "There are more reads in file 1 than in file 2" in info.value.message
- def test_empty_sequences_do_not_stop_iteration(self):
+ def test_empty_sequences_do_not_stop_iteration(self) -> None:
s1 = BytesIO(b"@r1_1\nACG\n+\nHHH\n at r2_1\nACG\n+\nHHH\n at r3_2\nACG\n+\nHHH\n")
s2 = BytesIO(b"@r1_1\nACG\n+\nHHH\n at r2_2\n\n+\n\n at r3_2\nACG\n+\nHHH\n")
# Second sequence for s2 is empty but valid. Should not lead to a stop of iteration.
@@ -564,7 +568,7 @@ class TestPairedSequenceReader:
print(seqs)
assert len(seqs) == 3
- def test_incorrectly_paired(self):
+ def test_incorrectly_paired(self) -> None:
s1 = BytesIO(b"@r1/1\nACG\n+\nHHH\n")
s2 = BytesIO(b"@wrong_name\nTTT\n+\nHHH\n")
with raises(FileFormatError) as info:
@@ -582,7 +586,7 @@ class TestPairedSequenceReader:
os.path.join("tests", "data", "with_comment.fasta"),
],
)
-def test_read_stdin(path):
+def test_read_stdin(path) -> None:
# Get number of records in the input file
with dnaio.open(path) as f:
expected = len(list(f))
@@ -597,12 +601,13 @@ def test_read_stdin(path):
stdin=cat.stdout,
stdout=subprocess.PIPE,
) as py:
+ assert cat.stdout is not None
cat.stdout.close()
# Check that the read_from_stdin.py script prints the correct number of records
assert str(expected) == py.communicate()[0].decode().strip()
-def test_file_writer(tmp_path):
+def test_file_writer(tmp_path) -> None:
path = tmp_path / "out.txt"
fw = FileWriter(path)
repr(fw)
@@ -614,7 +619,7 @@ def test_file_writer(tmp_path):
assert "operation on closed file" in e.value.args[0]
-def test_binary_file_reader():
+def test_binary_file_reader() -> None:
bfr = BinaryFileReader("tests/data/simple.fasta")
repr(bfr)
bfr.close()
@@ -624,19 +629,17 @@ def test_binary_file_reader():
assert "operation on closed" in e.value.args[0]
-def test_fasta_writer_repr(tmp_path):
+def test_fasta_writer_repr(tmp_path) -> None:
with FastaWriter(tmp_path / "out.fasta") as fw:
repr(fw)
-def test_fastq_writer_repr(tmp_path):
+def test_fastq_writer_repr(tmp_path) -> None:
with FastqWriter(tmp_path / "out.fastq") as fw:
repr(fw)
class TestAsciiCheck:
- from dnaio._core import bytes_ascii_check
-
ASCII_STRING = (
"In het Nederlands komen bijzondere leestekens niet vaak voor.".encode("ascii")
)
@@ -645,25 +648,25 @@ class TestAsciiCheck:
"In späterer Zeit trat Umlaut sehr häufig analogisch ein.".encode("latin-1")
)
- def test_ascii(self):
- assert self.bytes_ascii_check(self.ASCII_STRING)
+ def test_ascii(self) -> None:
+ assert bytes_ascii_check(self.ASCII_STRING)
- def test_ascii_all_chars(self):
- assert self.bytes_ascii_check(bytes(range(128)))
- assert not self.bytes_ascii_check(bytes(range(129)))
+ def test_ascii_all_chars(self) -> None:
+ assert bytes_ascii_check(bytes(range(128)))
+ assert not bytes_ascii_check(bytes(range(129)))
- def test_non_ascii(self):
- assert not self.bytes_ascii_check(self.NON_ASCII_STRING)
+ def test_non_ascii(self) -> None:
+ assert not bytes_ascii_check(self.NON_ASCII_STRING)
- def test_non_ascii_lengths(self):
+ def test_non_ascii_lengths(self) -> None:
# Make sure that the function finds the non-ascii byte correctly for
# all lengths.
non_ascii_char = "é".encode("latin-1")
for i in range(len(self.ASCII_STRING)):
test_string = self.ASCII_STRING[:i] + non_ascii_char
- assert not self.bytes_ascii_check(test_string)
+ assert not bytes_ascii_check(test_string)
- def test_ascii_lengths(self):
+ def test_ascii_lengths(self) -> None:
# Make sure the ascii check is correct even though there are non-ASCII
# bytes directly behind the search space.
# This ensures there is no overshoot where the algorithm checks bytes
@@ -671,11 +674,11 @@ class TestAsciiCheck:
non_ascii_char = "é".encode("latin-1")
for i in range(1, len(self.ASCII_STRING) + 1):
test_string = self.ASCII_STRING[:i] + (non_ascii_char * 8)
- assert self.bytes_ascii_check(test_string, i - 1)
+ assert bytes_ascii_check(test_string, i - 1)
class TestRecordsAreMates:
- def test_records_are_mates(self):
+ def test_records_are_mates(self) -> None:
assert records_are_mates(
SequenceRecord("same_name1 some_comment", "A", "H"),
SequenceRecord("same_name2 other_comment", "A", "H"),
@@ -683,23 +686,23 @@ class TestRecordsAreMates:
)
@pytest.mark.parametrize("number_of_mates", list(range(2, 11)))
- def test_lots_of_records_are_mates(self, number_of_mates):
+ def test_lots_of_records_are_mates(self, number_of_mates) -> None:
mates = [SequenceRecord("name", "A", "H") for _ in range(number_of_mates)]
assert records_are_mates(*mates)
- def test_records_are_not_mates(self):
+ def test_records_are_not_mates(self) -> None:
assert not records_are_mates(
SequenceRecord("same_name1 some_comment", "A", "H"),
SequenceRecord("same_name2 other_comment", "A", "H"),
SequenceRecord("shame_name3 different_comment", "A", "H"),
)
- def test_records_are_mates_zero_arguments(self):
+ def test_records_are_mates_zero_arguments(self) -> None:
with pytest.raises(TypeError) as error:
- records_are_mates()
+ records_are_mates() # type: ignore
error.match("records_are_mates requires at least two arguments")
- def test_records_are_mates_one_argument(self):
+ def test_records_are_mates_one_argument(self) -> None:
with pytest.raises(TypeError) as error:
- records_are_mates(SequenceRecord("A", "A", "A"))
+ records_are_mates(SequenceRecord("A", "A", "A")) # type: ignore
error.match("records_are_mates requires at least two arguments")
=====================================
tests/test_multiple.py
=====================================
@@ -15,7 +15,7 @@ import pytest
)
def test_read_files(fileformat, number_of_files):
file = Path(__file__).parent / "data" / ("simple." + fileformat)
- files = [file for _ in range(number_of_files)]
+ files = [file] * number_of_files
with _open_multiple(*files) as multiple_reader:
for records in multiple_reader:
pass
@@ -102,7 +102,7 @@ def test_multiple_read_unmatched_names(number_of_files):
with _open_multiple(*files) as reader:
with pytest.raises(dnaio.FileFormatError) as error:
for records in reader:
- pass
+ pass # pragma: no coverage
error.match("do not match")
=====================================
tests/test_open.py
=====================================
@@ -30,34 +30,34 @@ SIMPLE_RECORDS = {
}
-def formatted_sequence(record, fileformat):
+def formatted_sequence(record, fileformat) -> str:
if fileformat == "fastq":
return "@{}\n{}\n+\n{}\n".format(record.name, record.sequence, record.qualities)
else:
return ">{}\n{}\n".format(record.name, record.sequence)
-def formatted_sequences(records, fileformat):
+def formatted_sequences(records, fileformat) -> str:
return "".join(formatted_sequence(record, fileformat) for record in records)
-def test_formatted_sequence():
+def test_formatted_sequence() -> None:
s = dnaio.SequenceRecord("s1", "ACGT", "HHHH")
assert ">s1\nACGT\n" == formatted_sequence(s, "fasta")
assert "@s1\nACGT\n+\nHHHH\n" == formatted_sequence(s, "fastq")
-def test_version():
+def test_version() -> None:
_ = dnaio.__version__
-def test_open_nonexistent(tmp_path):
+def test_open_nonexistent(tmp_path) -> None:
with pytest.raises(FileNotFoundError):
with dnaio.open(tmp_path / "nonexistent"):
pass # pragma: no cover
-def test_open_empty_file_with_unrecognized_extension(tmp_path):
+def test_open_empty_file_with_unrecognized_extension(tmp_path) -> None:
path = tmp_path / "unrecognized-extension.tmp"
path.touch()
with dnaio.open(path) as f:
@@ -65,7 +65,7 @@ def test_open_empty_file_with_unrecognized_extension(tmp_path):
assert records == []
-def test_fileformat_error(tmp_path):
+def test_fileformat_error(tmp_path) -> None:
with open(tmp_path / "file.fastq", mode="w") as f:
print("this is not a FASTQ file", file=f)
with pytest.raises(FileFormatError) as e:
@@ -74,13 +74,13 @@ def test_fileformat_error(tmp_path):
assert "at line 2" in str(e.value) # Premature end of file
-def test_write_unknown_file_format(tmp_path):
+def test_write_unknown_file_format(tmp_path) -> None:
with pytest.raises(UnknownFileFormat):
with dnaio.open(tmp_path / "out.txt", mode="w") as f:
f.write(dnaio.SequenceRecord("name", "ACG", "###")) # pragma: no cover
-def test_read_unknown_file_format(tmp_path):
+def test_read_unknown_file_format(tmp_path) -> None:
with open(tmp_path / "file.txt", mode="w") as f:
print("text file", file=f)
with pytest.raises(UnknownFileFormat):
@@ -88,13 +88,13 @@ def test_read_unknown_file_format(tmp_path):
_ = list(f) # pragma: no cover
-def test_invalid_format(tmp_path):
+def test_invalid_format(tmp_path) -> None:
with pytest.raises(UnknownFileFormat):
with dnaio.open(tmp_path / "out.txt", mode="w", fileformat="foo"):
pass # pragma: no cover
-def test_write_qualities_to_file_without_fastq_extension(tmp_path):
+def test_write_qualities_to_file_without_fastq_extension(tmp_path) -> None:
with dnaio.open(tmp_path / "out.txt", mode="w", qualities=True) as f:
f.write(dnaio.SequenceRecord("name", "ACG", "###"))
@@ -102,20 +102,20 @@ def test_write_qualities_to_file_without_fastq_extension(tmp_path):
f.write(dnaio.SequenceRecord("name", "ACG", None))
-def test_read(fileformat, extension):
+def test_read(fileformat, extension) -> None:
with dnaio.open("tests/data/simple." + fileformat + extension) as f:
records = list(f)
assert records == SIMPLE_RECORDS[fileformat]
-def test_read_pathlib_path(fileformat, extension):
+def test_read_pathlib_path(fileformat, extension) -> None:
path = Path("tests/data/simple." + fileformat + extension)
with dnaio.open(path) as f:
records = list(f)
assert records == SIMPLE_RECORDS[fileformat]
-def test_read_opener(fileformat, extension):
+def test_read_opener(fileformat, extension) -> None:
def my_opener(path, mode):
import io
@@ -134,14 +134,14 @@ def test_read_opener(fileformat, extension):
assert records[0].sequence == "ACG"
-def test_read_paired_fasta():
+def test_read_paired_fasta() -> None:
path = "tests/data/simple.fasta"
- with dnaio.open(file1=path, file2=path) as f:
+ with dnaio.open(path, path) as f:
list(f)
@pytest.mark.parametrize("interleaved", [False, True])
-def test_paired_opener(fileformat, extension, interleaved):
+def test_paired_opener(fileformat, extension, interleaved) -> None:
def my_opener(_path, _mode):
import io
@@ -154,7 +154,7 @@ def test_paired_opener(fileformat, extension, interleaved):
path1 = "ignored-filename." + fileformat + extension
path2 = "also-ignored-filename." + fileformat + extension
if interleaved:
- with dnaio.open(path1, file2=path2, opener=my_opener) as f:
+ with dnaio.open(path1, path2, opener=my_opener) as f:
records = list(f)
expected = 2
else:
@@ -168,21 +168,21 @@ def test_paired_opener(fileformat, extension, interleaved):
assert records[0][1].sequence == "ACG"
-def test_detect_fastq_from_content():
+def test_detect_fastq_from_content() -> None:
"""FASTQ file that is not named .fastq"""
with dnaio.open("tests/data/missingextension") as f:
record = next(iter(f))
assert record.name == "prefix:1_13_573/1"
-def test_detect_compressed_fastq_from_content():
+def test_detect_compressed_fastq_from_content() -> None:
"""Compressed FASTQ file that is not named .fastq.gz"""
with dnaio.open("tests/data/missingextension.gz") as f:
record = next(iter(f))
assert record.name == "prefix:1_13_573/1"
-def test_write(tmp_path, extension):
+def test_write(tmp_path, extension) -> None:
out_fastq = tmp_path / ("out.fastq" + extension)
with dnaio.open(str(out_fastq), mode="w") as f:
f.write(dnaio.SequenceRecord("name", "ACGT", "HHHH"))
@@ -190,7 +190,7 @@ def test_write(tmp_path, extension):
assert f.read() == "@name\nACGT\n+\nHHHH\n"
-def test_write_with_xopen(tmp_path, fileformat, extension):
+def test_write_with_xopen(tmp_path, fileformat, extension) -> None:
s = dnaio.SequenceRecord("name", "ACGT", "HHHH")
out_fastq = tmp_path / ("out." + fileformat + extension)
with xopen(out_fastq, "wb") as outer_f:
@@ -204,7 +204,7 @@ def test_write_with_xopen(tmp_path, fileformat, extension):
assert f.read() == "@name\nACGT\n+\nHHHH\n"
-def test_write_str_path(tmp_path, fileformat, extension):
+def test_write_str_path(tmp_path, fileformat, extension) -> None:
s1 = dnaio.SequenceRecord("s1", "ACGT", "HHHH")
path = str(tmp_path / ("out." + fileformat + extension))
with dnaio.open(path, mode="w") as f:
@@ -217,15 +217,15 @@ def test_write_str_path(tmp_path, fileformat, extension):
assert f.read() == expected
-def test_write_paired_same_path(tmp_path):
+def test_write_paired_same_path(tmp_path) -> None:
path1 = tmp_path / "same.fastq"
path2 = tmp_path / "same.fastq"
with pytest.raises(ValueError):
- with dnaio.open(file1=path1, file2=path2, mode="w"):
+ with dnaio.open(path1, path2, mode="w"):
pass # pragma: no cover
-def test_write_paired(tmp_path, fileformat, extension):
+def test_write_paired(tmp_path, fileformat, extension) -> None:
r1 = [
dnaio.SequenceRecord("s1", "ACGT", "HHHH"),
dnaio.SequenceRecord("s2", "CGCA", "8383"),
@@ -237,7 +237,7 @@ def test_write_paired(tmp_path, fileformat, extension):
path1 = tmp_path / ("out.1." + fileformat + extension)
path2 = tmp_path / ("out.2." + fileformat + extension)
- with dnaio.open(path1, file2=path2, fileformat=fileformat, mode="w") as f:
+ with dnaio.open(path1, path2, fileformat=fileformat, mode="w") as f:
f.write(r1[0], r2[0])
f.write(r1[1], r2[1])
with xopen(path1) as f:
@@ -246,7 +246,7 @@ def test_write_paired(tmp_path, fileformat, extension):
assert formatted_sequences(r2, fileformat) == f.read()
-def test_write_interleaved(tmp_path, fileformat, extension):
+def test_write_interleaved(tmp_path, fileformat, extension) -> None:
r1 = [
dnaio.SequenceRecord("s1", "ACGT", "HHHH"),
dnaio.SequenceRecord("s2", "CGCA", "8383"),
@@ -265,7 +265,7 @@ def test_write_interleaved(tmp_path, fileformat, extension):
assert formatted_sequences(expected, fileformat) == f.read()
-def test_append(tmp_path, fileformat, extension):
+def test_append(tmp_path, fileformat, extension) -> None:
s1 = dnaio.SequenceRecord("s1", "ACGT", "HHHH")
s2 = dnaio.SequenceRecord("s2", "CGCA", "8383")
path = tmp_path / ("out." + fileformat + extension)
@@ -277,7 +277,7 @@ def test_append(tmp_path, fileformat, extension):
assert formatted_sequences([s1, s2], fileformat) == f.read()
-def test_append_paired(tmp_path, fileformat, extension):
+def test_append_paired(tmp_path, fileformat, extension) -> None:
r1 = [
dnaio.SequenceRecord("s1", "ACGT", "HHHH"),
dnaio.SequenceRecord("s2", "CGCA", "8383"),
@@ -289,9 +289,9 @@ def test_append_paired(tmp_path, fileformat, extension):
path1 = tmp_path / ("out.1." + fileformat + extension)
path2 = tmp_path / ("out.2." + fileformat + extension)
- with dnaio.open(path1, file2=path2, fileformat=fileformat, mode="w") as f:
+ with dnaio.open(path1, path2, fileformat=fileformat, mode="w") as f:
f.write(r1[0], r2[0])
- with dnaio.open(path1, file2=path2, fileformat=fileformat, mode="a") as f:
+ with dnaio.open(path1, path2, fileformat=fileformat, mode="a") as f:
f.write(r1[1], r2[1])
with xopen(path1) as f:
assert formatted_sequences(r1, fileformat) == f.read()
@@ -299,7 +299,7 @@ def test_append_paired(tmp_path, fileformat, extension):
assert formatted_sequences(r2, fileformat) == f.read()
-def test_append_interleaved(tmp_path, fileformat, extension):
+def test_append_interleaved(tmp_path, fileformat, extension) -> None:
r1 = [
dnaio.SequenceRecord("s1", "ACGT", "HHHH"),
dnaio.SequenceRecord("s2", "CGCA", "8383"),
@@ -319,7 +319,7 @@ def test_append_interleaved(tmp_path, fileformat, extension):
assert formatted_sequences(expected, fileformat) == f.read()
-def make_random_fasta(path, n_records):
+def make_random_fasta(path, n_records) -> None:
from random import choice
with xopen(path, "w") as f:
@@ -329,7 +329,7 @@ def make_random_fasta(path, n_records):
print(">", name, "\n", sequence, sep="", file=f)
-def test_islice_gzip_does_not_fail(tmp_path):
+def test_islice_gzip_does_not_fail(tmp_path) -> None:
path = tmp_path / "file.fasta.gz"
make_random_fasta(path, 100)
f = dnaio.open(path)
@@ -337,22 +337,22 @@ def test_islice_gzip_does_not_fail(tmp_path):
f.close()
-def test_unsupported_mode():
+def test_unsupported_mode() -> None:
with pytest.raises(ValueError) as error:
- _ = dnaio.open(os.devnull, mode="x")
+ _ = dnaio.open(os.devnull, mode="x") # type: ignore
error.match("Mode must be")
-def test_no_file2_with_multiple_args():
+def test_no_file2_with_multiple_args() -> None:
with pytest.raises(ValueError) as error:
- _ = dnaio.open(os.devnull, os.devnull, file2=os.devnull)
+ _ = dnaio.open(os.devnull, os.devnull, file2=os.devnull) # type: ignore
error.match("as positional argument")
error.match("file2")
-def test_no_multiple_files_interleaved():
+def test_no_multiple_files_interleaved() -> None:
with pytest.raises(ValueError) as error:
- _ = dnaio.open(os.devnull, os.devnull, interleaved=True)
+ _ = dnaio.open(os.devnull, os.devnull, interleaved=True) # type: ignore
error.match("interleaved")
error.match("one file")
@@ -361,7 +361,9 @@ def test_no_multiple_files_interleaved():
["mode", "expected_class"],
[("r", dnaio.PairedEndReader), ("w", dnaio.PairedEndWriter)],
)
-def test_paired_open_with_multiple_args(tmp_path, fileformat, mode, expected_class):
+def test_paired_open_with_multiple_args(
+ tmp_path, fileformat, mode, expected_class
+) -> None:
path = tmp_path / "file"
path2 = tmp_path / "file2"
path.touch()
@@ -379,6 +381,36 @@ def test_paired_open_with_multiple_args(tmp_path, fileformat, mode, expected_cla
({"mode": "w", "fileformat": "fasta"}, dnaio.multipleend.MultipleFastaWriter),
],
)
-def test_multiple_open_fastq(kwargs, expected_class):
+def test_multiple_open_fastq(kwargs, expected_class) -> None:
with dnaio.open(os.devnull, os.devnull, os.devnull, **kwargs) as f:
assert isinstance(f, expected_class)
+
+
+def test_deprecated_file1_file2_keyword_arguments(tmp_path):
+ path = Path("tests/data/simple.fasta")
+ expected = SIMPLE_RECORDS["fasta"]
+ with dnaio.open(file1=path) as f:
+ records = list(f)
+ assert records == expected
+
+ with dnaio.open(path, file2=path) as f:
+ records = list(f)
+ assert records == list(zip(expected, expected))
+
+ with dnaio.open(file1=path, file2=path) as f:
+ records = list(f)
+ assert records == list(zip(expected, expected))
+
+
+def test_positional_with_file1():
+ with pytest.raises(ValueError) as error:
+ with dnaio.open("in.fastq", file1="in2.fastq"):
+ pass # pragma: no cover
+ error.match("file1 keyword argument cannot be used together")
+
+
+def test_positional_with_file1_and_file2():
+ with pytest.raises(ValueError) as error:
+ with dnaio.open("in.fastq", file1="in2.fastq", file2="in3.fastq"):
+ pass # pragma: no cover
+ error.match("cannot be used together")
=====================================
tests/test_records.py
=====================================
@@ -105,6 +105,54 @@ class TestSequenceRecord:
seq.qualities = None
assert seq.qualities is None
+ def test_set_id(self):
+ seq = SequenceRecord("name", "A", "=")
+ with pytest.raises(AttributeError):
+ seq.id = "Obi-Wan"
+
+ def test_set_comment(self):
+ seq = SequenceRecord("name", "A", "=")
+ with pytest.raises(AttributeError):
+ seq.comment = "Hello there!"
+
+ @pytest.mark.parametrize(
+ ["record", "expected"],
+ [
+ (SequenceRecord("name", "A", "="), None),
+ (SequenceRecord("name ", "A", "="), None),
+ (SequenceRecord("name ", "A", "="), None),
+ (SequenceRecord("name", "A", "="), None),
+ (SequenceRecord("AotC I hate sand!", "A", "="), "I hate sand!"),
+ (
+ SequenceRecord("Givemesome space", "A", "="),
+ "space",
+ ),
+ ],
+ )
+ def test_get_comment(self, record, expected):
+ assert record.comment == expected
+
+ @pytest.mark.parametrize(
+ ["record", "expected"],
+ [
+ (SequenceRecord("name", "A", "="), "name"),
+ (SequenceRecord("name ", "A", "="), "name"),
+ (SequenceRecord("name ", "A", "="), "name"),
+ (SequenceRecord("name", "A", "="), "name"),
+ (SequenceRecord("AotC I hate sand!", "A", "="), "AotC"),
+ ],
+ )
+ def test_get_id(self, record, expected):
+ assert record.id == expected
+
+ def test_reset_id_and_comment_on_name_update(self):
+ record = SequenceRecord("Obi-Wan: don't try it!", "", "")
+ assert record.id == "Obi-Wan:"
+ assert record.comment == "don't try it!"
+ record.name = "Anakin: you underestimate my power!"
+ assert record.id == "Anakin:"
+ assert record.comment == "you underestimate my power!"
+
def test_legacy_sequence():
from dnaio import Sequence
=====================================
tox.ini
=====================================
@@ -1,5 +1,5 @@
[tox]
-envlist = flake8,black,mypy,docs,py37,py38,py39,py310,py311
+envlist = flake8,black,mypy,docs,py38,py39,py310,py311,py312
isolated_build = True
[testenv]
@@ -14,45 +14,29 @@ commands =
setenv = PYTHONDEVMODE = 1
[testenv:flake8]
-basepython = python3.7
+basepython = python3.10
deps = flake8
commands = flake8 src/ tests/
[testenv:black]
-basepython = python3.7
+basepython = python3.10
deps = black==22.3.0
skip_install = true
commands = black --check src/ tests/ doc/ helpers/ setup.py
[testenv:mypy]
-basepython = python3.7
-deps = mypy
-commands = mypy src/
+basepython = python3.10
+deps =
+ mypy
+ pytest
+commands = mypy src/ tests/
[testenv:docs]
-basepython = python3.7
+basepython = python3.10
changedir = doc
deps = -r doc/requirements.txt
commands = sphinx-build -W -d {envtmpdir}/doctrees . {envtmpdir}/html
-[coverage:run]
-branch = True
-parallel = True
-include =
- */site-packages/dnaio/*
- tests/*
-
-[coverage:paths]
-source =
- src/
- */site-packages/
-
-[coverage:report]
-precision = 1
-exclude_lines =
- pragma: no cover
- def __repr__
-
[flake8]
max-line-length = 99
max-complexity = 15
View it on GitLab: https://salsa.debian.org/med-team/python-dnaio/-/commit/7d3432057b2878339b6a620c4762002829fda35c
--
View it on GitLab: https://salsa.debian.org/med-team/python-dnaio/-/commit/7d3432057b2878339b6a620c4762002829fda35c
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20231013/5137a835/attachment-0001.htm>
More information about the debian-med-commit
mailing list