[med-svn] [Git][med-team/python-dnaio][master] 4 commits: New upstream version 0.10.0
Nilesh Patra (@nilesh)
gitlab at salsa.debian.org
Sat Dec 31 11:40:18 GMT 2022
Nilesh Patra pushed to branch master at Debian Med / python-dnaio
Commits:
9a606fda by Nilesh Patra at 2022-12-31T17:03:50+05:30
New upstream version 0.10.0
- - - - -
ecc4f29e by Nilesh Patra at 2022-12-31T17:03:50+05:30
Update upstream source from tag 'upstream/0.10.0'
Update to upstream version '0.10.0'
with Debian dir 7c21dae8da332ca6b952e1452395dedef7c163ab
- - - - -
142d2736 by Nilesh Patra at 2022-12-31T17:04:03+05:30
Bump Standards-Version to 4.6.2 (no changes needed)
- - - - -
5855df2a by Nilesh Patra at 2022-12-31T17:05:16+05:30
Upload to unstable
- - - - -
22 changed files:
- − .github/workflows/ci.yml
- CHANGES.rst
- debian/changelog
- debian/control
- doc/api.rst
- doc/tutorial.rst
- pyproject.toml
- src/dnaio/__init__.py
- src/dnaio/_core.pyi
- src/dnaio/_core.pyx
- src/dnaio/_util.py
- src/dnaio/interfaces.py
- + src/dnaio/multipleend.py
- src/dnaio/pairedend.py
- src/dnaio/readers.py
- src/dnaio/singleend.py
- src/dnaio/writers.py
- tests/test_chunks.py
- tests/test_internal.py
- + tests/test_multiple.py
- tests/test_open.py
- tox.ini
Changes:
=====================================
.github/workflows/ci.yml deleted
=====================================
@@ -1,107 +0,0 @@
-name: CI
-
-on: [push, pull_request]
-
-jobs:
- lint:
- timeout-minutes: 10
- runs-on: ubuntu-latest
- strategy:
- matrix:
- python-version: [3.7]
- toxenv: [flake8, black, mypy, docs]
- steps:
- - uses: actions/checkout at v2
- - name: Set up Python ${{ matrix.python-version }}
- uses: actions/setup-python at v2
- with:
- python-version: ${{ matrix.python-version }}
- - name: Install tox
- run: python -m pip install tox
- - name: Run tox ${{ matrix.toxenv }}
- run: tox -e ${{ matrix.toxenv }}
-
- build:
- runs-on: ubuntu-latest
- steps:
- - uses: actions/checkout at v2
- with:
- fetch-depth: 0 # required for setuptools_scm
- - name: Build sdist and temporary wheel
- run: pipx run build
- - uses: actions/upload-artifact at v2
- with:
- name: sdist
- path: dist/*.tar.gz
-
- test:
- timeout-minutes: 10
- runs-on: ${{ matrix.os }}
- strategy:
- matrix:
- os: [ubuntu-latest]
- python-version: ["3.7", "3.8", "3.9", "3.10"]
- include:
- - os: macos-latest
- python-version: 3.8
- - os: windows-latest
- python-version: 3.8
- steps:
- - uses: actions/checkout at v2
- - name: Set up Python ${{ matrix.python-version }}
- uses: actions/setup-python at v2
- with:
- python-version: ${{ matrix.python-version }}
- - name: Install tox
- run: python -m pip install tox
- - name: Test
- run: tox -e py
- - name: Upload coverage report
- uses: codecov/codecov-action at v1
-
- wheels:
- if: github.event_name == 'push' && startsWith(github.ref, 'refs/tags')
- needs: [lint, test]
- timeout-minutes: 15
- strategy:
- matrix:
- os: [ubuntu-20.04, windows-2019, macos-10.15]
- runs-on: ${{ matrix.os }}
- steps:
- - uses: actions/checkout at v2
- with:
- fetch-depth: 0 # required for setuptools_scm
- - name: Build wheels
- uses: pypa/cibuildwheel at v2.3.1
- env:
- CIBW_BUILD: "cp*-manylinux_x86_64 cp3*-win_amd64 cp3*-macosx_x86_64"
- CIBW_ENVIRONMENT: "CFLAGS=-g0"
- CIBW_TEST_REQUIRES: "pytest"
- CIBW_TEST_COMMAND_LINUX: "cd {project} && pytest tests"
- CIBW_TEST_COMMAND_MACOS: "cd {project} && pytest tests"
- CIBW_TEST_COMMAND_WINDOWS: "cd /d {project} && pytest tests"
- - uses: actions/upload-artifact at v2
- with:
- name: wheels
- path: wheelhouse/*.whl
-
- publish:
- if: github.event_name == 'push' && startsWith(github.ref, 'refs/tags')
- needs: [build, wheels]
- runs-on: ubuntu-latest
- steps:
- - uses: actions/download-artifact at v2
- with:
- name: sdist
- path: dist/
- - uses: actions/download-artifact at v2
- with:
- name: wheels
- path: dist/
- - name: Publish to PyPI
- uses: pypa/gh-action-pypi-publish at v1.4.2
- with:
- user: __token__
- password: ${{ secrets.pypi_password }}
- #password: ${{ secrets.test_pypi_password }}
- #repository_url: https://test.pypi.org/legacy/
=====================================
CHANGES.rst
=====================================
@@ -2,6 +2,18 @@
Changelog
=========
+v0.10.0 (2022-12-05)
+--------------------
+
+* :pr:`99`: SequenceRecord initialization is now faster, which also provides
+ a speed boost to FASTQ iteration. ``SequenceRecord.__new__`` cannot be used
+ anymore to initialize `SequenceRecord` objects.
+* :pr:`96`: ``open_threads`` and ``compression_level`` are now added
+ to `~dnaio.open` as arguments. By default dnaio now uses compression level
+ 1 and does not utilize external programs to speed up gzip (de)compression.
+* :pr:`87`: `~dnaio.open` can now open more than two files.
+ The ``file1`` and ``file2`` arguments are now deprecated.
+
v0.9.1 (2022-08-01)
-------------------
@@ -12,7 +24,7 @@ v0.9.1 (2022-08-01)
v0.9.0 (2022-05-17)
-------------------
-* :pr:`79`: Added a `records_are_mates` function to be used for checking whether
+* :pr:`79`: Added a `~dnaio.records_are_mates` function to be used for checking whether
three or more records are mates of each other (by checking the ID).
* :pr:`74`, :pr:`68`: Made FASTQ parsing faster by implementing the check for
ASCII using SSE vector instructions.
@@ -23,12 +35,13 @@ v0.8.0 (2022-03-26)
* Preliminary documentation is available at
<https://dnaio.readthedocs.io/>.
-* :pr:`53`: Renamed ``Sequence`` to `SequenceRecord`.
+* :pr:`53`: Renamed ``Sequence`` to `~dnaio.SequenceRecord`.
The previous name is still available as an alias
so that existing code will continue to work.
* When reading a FASTQ file, there is now a check that ensures that
all characters are ASCII.
-* Function ``record_names_match`` is deprecated, use `SequenceRecord.is_mate` instead.
+* Function ``record_names_match`` is deprecated, use `~dnaio.SequenceRecord.is_mate` instead.
+* Added `~dnaio.SequenceRecord.reverse_complement`.
* Dropped Python 3.6 support as it is end-of-life.
v0.7.1 (2022-01-26)
=====================================
debian/changelog
=====================================
@@ -1,8 +1,14 @@
-python-dnaio (0.9.1-2) UNRELEASED; urgency=medium
+python-dnaio (0.10.0-1) unstable; urgency=medium
+ * Team Upload.
+ [ Andreas Tille ]
* d/watch: Proper upstream tarball name
- -- Andreas Tille <tille at debian.org> Fri, 26 Aug 2022 08:13:03 +0200
+ [ Nilesh Patra ]
+ * New upstream version 0.10.0
+ * Bump Standards-Version to 4.6.2 (no changes needed)
+
+ -- Nilesh Patra <nilesh at debian.org> Sat, 31 Dec 2022 17:05:08 +0530
python-dnaio (0.9.1-1) unstable; urgency=medium
=====================================
debian/control
=====================================
@@ -11,7 +11,7 @@ Build-Depends: debhelper-compat (= 13),
python3-pytest,
python3-xopen,
cython3
-Standards-Version: 4.6.1
+Standards-Version: 4.6.2
Vcs-Browser: https://salsa.debian.org/med-team/python-dnaio
Vcs-Git: https://salsa.debian.org/med-team/python-dnaio.git
Homepage: https://github.com/marcelm/dnaio
=====================================
doc/api.rst
=====================================
@@ -36,6 +36,9 @@ Reader and writer interfaces
.. autoclass:: PairedEndWriter
:members: write
+.. autoclass:: MultipleFileWriter
+ :members: write, write_iterable
+
Reader and writer classes
-------------------------
@@ -68,6 +71,15 @@ They can also be used directly if needed.
.. autoclass:: InterleavedPairedEndWriter
:show-inheritance:
+.. autoclass:: MultipleFileReader
+ :members: __iter__
+
+.. autoclass:: MultipleFastaWriter
+ :show-inheritance:
+
+.. autoclass:: MultipleFastqWriter
+ :show-inheritance:
+
Chunked reading of sequence records
-----------------------------------
=====================================
doc/tutorial.rst
=====================================
@@ -61,9 +61,8 @@ pass the ``mode="w"`` argument to ``dnaio.open``::
Here, a `~dnaio.FastqWriter` object is returned by ``dnaio.open``,
which has a `~dnaio.FastqWriter.write()` method that accepts a ``SequenceRecord``.
-Instead of constructing a single record from scratch,
-in practice it is more realistic to take input reads,
-process them, and write them to a new output file.
+A possibly more common use case is to read an input file,
+modify the reads and write them to a new output file.
The following example program shows how that can be done.
It truncates all reads in the input file to a length of 30 nt
and writes them to another file::
@@ -83,23 +82,22 @@ trimmed to the first 30 characters, leaving the name unchanged.
Paired-end data
---------------
-Paired-end data is supported in two forms:
-Either as a single file that contains the read in an interleaved form (R1, R2, R1, R2, ...)
-or as two separate files. To read from separate files, provide the ``file2=`` argument
-with the name of the second file to ``dnaio.open``::
+Paired-end data is supported in two forms: Two separate files or interleaved.
+
+To read from separate files, provide two input file names to the ``dnaio.open``
+function::
import dnaio
- with dnaio.open("reads.1.fastq.gz", file2="reads.2.fastq.gz") as reader:
+ with dnaio.open("reads.1.fastq.gz", "reads.2.fastq.gz") as reader:
bp = 0
for r1, r2 in reader:
bp += len(r1) + len(r2)
print(f"The paired-end input contains {bp/1E6:.1f} Mbp")
-Note that ``file2`` is a keyword-only argument, so you need to write the ``file2=`` part.
-In this example, ``dnaio.open`` returns a `~dnaio.TwoFilePairedEndReader`.
-It also supports iteration, but instead of a single ``SequenceRecord``,
-it returns a pair of them.
+Here, ``dnaio.open`` returns a `~dnaio.TwoFilePairedEndReader`.
+It also supports iteration, but instead of a plain ``SequenceRecord``,
+it returns a tuple of two ``SequenceRecord`` instances.
To read from interleaved paired-end data,
pass ``interleaved=True`` to ``dnaio.open`` instead of a second file name::
@@ -122,7 +120,7 @@ to R2::
import dnaio
with dnaio.open("in.fastq.gz") as reader, \
- dnaio.open("out.1.fastq.gz", file2="out.2.fastq.gz", mode="w") as writer:
+ dnaio.open("out.1.fastq.gz", "out.2.fastq.gz", mode="w") as writer:
for record in reader:
r1 = record[:30]
r2 = record[-30:]
=====================================
pyproject.toml
=====================================
@@ -7,3 +7,12 @@ write_to = "src/dnaio/_version.py"
[tool.pytest.ini_options]
testpaths = ["tests"]
+
+[tool.cibuildwheel]
+environment = "CFLAGS=-g0"
+test-requires = "pytest"
+test-command = ["cd {project}", "pytest tests"]
+
+[[tool.cibuildwheel.overrides]]
+select = "*-win*"
+test-command = ["cd /d {project}", "pytest tests"]
=====================================
src/dnaio/__init__.py
=====================================
@@ -21,12 +21,16 @@ __all__ = [
"InterleavedPairedEndWriter",
"TwoFilePairedEndReader",
"TwoFilePairedEndWriter",
+ "MultipleFileReader",
+ "MultipleFastaWriter",
+ "MultipleFastqWriter",
"read_chunks",
"read_paired_chunks",
"records_are_mates",
"__version__",
]
+import functools
from os import PathLike
from typing import Optional, Union, BinaryIO
@@ -47,6 +51,13 @@ from .pairedend import (
InterleavedPairedEndReader,
InterleavedPairedEndWriter,
)
+from .multipleend import (
+ MultipleFastaWriter,
+ MultipleFastqWriter,
+ MultipleFileReader,
+ MultipleFileWriter,
+ _open_multiple,
+)
from .exceptions import (
UnknownFileFormat,
FileFormatError,
@@ -63,40 +74,50 @@ from .chunks import read_chunks, read_paired_chunks
from ._version import version as __version__
-# Backwards compatibility aliases
+# Backwards compatibility alias
Sequence = SequenceRecord
def open(
- file1: Union[str, PathLike, BinaryIO],
- *,
+ *files: Union[str, PathLike, BinaryIO],
+ file1: Optional[Union[str, PathLike, BinaryIO]] = None,
file2: Optional[Union[str, PathLike, BinaryIO]] = None,
fileformat: Optional[str] = None,
interleaved: bool = False,
mode: str = "r",
qualities: Optional[bool] = None,
- opener=xopen
-) -> Union[SingleEndReader, PairedEndReader, SingleEndWriter, PairedEndWriter]:
+ opener=xopen,
+ compression_level: int = 1,
+ open_threads: int = 0,
+) -> Union[
+ SingleEndReader,
+ PairedEndReader,
+ SingleEndWriter,
+ PairedEndWriter,
+ MultipleFileReader,
+ MultipleFileWriter,
+]:
"""
- Open one or two files in FASTA or FASTQ format for reading or writing.
+ Open one or more FASTQ or FASTA files for reading or writing.
Parameters:
+ files:
+ one or more Path or open file-like objects. One for single-end
+ reads, two for paired-end reads etc. More than two files are also
+ supported. At least one file is required.
file1:
- Path or an open file-like object. For reading single-end reads, this is
- the only required argument.
+ Deprecated keyword argument for the first file.
file2:
- Path or an open file-like object. When reading paired-end reads from
- two files, set this to the second file.
+ Deprecated keyword argument for the second file.
mode:
- Either ``'r'`` or ``'rb'`` for reading, ``'w'`` for writing
- or ``'a'`` for appending.
+ Set to ``'r'`` for reading, ``'w'`` for writing or ``'a'`` for appending.
interleaved:
- If True, then file1 contains interleaved paired-end data.
- file2 must be None in this case.
+ If True, then there must be only one file argument that contains
+ interleaved paired-end data.
fileformat:
If *None*, the file format is autodetected from the file name
@@ -114,29 +135,68 @@ def open(
- When False (no qualities available), an exception is raised when the
auto-detected output format is FASTQ.
- opener: A function that is used to open file1 and file2 if they are not
+ opener: A function that is used to open the files if they are not
already open file-like objects. By default, ``xopen`` is used, which can
also open compressed file formats.
- Return:
- A subclass of `SingleEndReader`, `PairedEndReader`, `SingleEndWriter` or
- `PairedEndWriter`.
+ open_threads: By default, dnaio opens files in the main thread.
+ When threads is greater than 0, external processes are opened for
+ compressing and decompressing files. This decreases wall clock time
+ at the cost of a little extra overhead. This parameter does not work
+ when a custom opener is set.
+
+ compression_level: By default dnaio uses compression level 1 for writing
+ gzipped files as this is the fastest. A higher level can be set using
+ this parameter. This parameter does not work when a custom opener is
+ set.
"""
- if mode not in ("r", "rb", "w", "a"):
- raise ValueError("Mode must be 'r', 'rb', 'w' or 'a'")
- elif interleaved and file2 is not None:
- raise ValueError("When interleaved is True, file2 must be None")
- elif interleaved or file2 is not None:
+ if files and (file1 is not None):
+ raise ValueError(
+ "The file1 keyword argument cannot be used together with files specified"
+ "as positional arguments"
+ )
+ elif len(files) > 1 and file2 is not None:
+ raise ValueError(
+ "The file2 argument cannot be used together with more than one "
+ "file specified as positional argument"
+ )
+ elif file1 is not None and file2 is not None and files:
+ raise ValueError(
+ "file1 and file2 arguments cannot be used together with files specified"
+ "as positional arguments"
+ )
+ elif file1 is not None and file2 is not None:
+ files = (file1, file2)
+ elif file2 is not None and len(files) == 1:
+ files = (files[0], file2)
+
+ if len(files) > 1 and interleaved:
+ raise ValueError("When interleaved is True, only one file must be specified.")
+ elif mode not in ("r", "w", "a"):
+ raise ValueError("Mode must be 'r', 'w' or 'a'")
+
+ if opener == xopen:
+ opener = functools.partial(
+ xopen, threads=open_threads, compresslevel=compression_level
+ )
+ if interleaved or len(files) == 2:
return _open_paired(
- file1,
- file2=file2,
+ *files,
opener=opener,
fileformat=fileformat,
- interleaved=interleaved,
mode=mode,
qualities=qualities,
)
+ elif len(files) > 2:
+ return _open_multiple(
+ *files, fileformat=fileformat, mode=mode, qualities=qualities, opener=opener
+ )
+
else:
return _open_single(
- file1, opener=opener, fileformat=fileformat, mode=mode, qualities=qualities
+ files[0],
+ opener=opener,
+ fileformat=fileformat,
+ mode=mode,
+ qualities=qualities,
)
=====================================
src/dnaio/_core.pyi
=====================================
@@ -1,4 +1,13 @@
-from typing import Optional, Tuple, BinaryIO, Iterator, Type, TypeVar, ByteString
+from typing import (
+ Generic,
+ Optional,
+ Tuple,
+ BinaryIO,
+ Iterator,
+ Type,
+ TypeVar,
+ ByteString,
+)
class SequenceRecord:
name: str
@@ -31,14 +40,14 @@ def records_are_mates(
T = TypeVar("T")
-class FastqIter:
+class FastqIter(Generic[T]):
def __init__(
self, file: BinaryIO, sequence_class: Type[T], buffer_size: int = ...
): ...
def __iter__(self) -> Iterator[T]: ...
def __next__(self) -> T: ...
@property
- def n_records(self) -> int: ...
+ def number_of_records(self) -> int: ...
# Deprecated
def record_names_match(header1: str, header2: str) -> bool: ...
=====================================
src/dnaio/_core.pyx
=====================================
@@ -41,6 +41,24 @@ def bytes_ascii_check(bytes string, Py_ssize_t length = -1):
return ascii
+def is_not_ascii_message(field, value):
+ """
+ Return an error message for a non-ASCII field encountered when initializing a SequenceRecord
+
+ Arguments:
+ field: Description of the field ("name", "sequence", "qualities" or similar)
+ in which non-ASCII characters were found
+ value: Unicode string that was intended to be assigned to the field
+ """
+ detail = ""
+ try:
+ value.encode("ascii")
+ except UnicodeEncodeError as e:
+ detail = (
+ f", but found '{value[e.start:e.end]}' at index {e.start}"
+ )
+ return f"'{field}' in sequence file must be ASCII encoded{detail}"
+
cdef class SequenceRecord:
"""
@@ -57,38 +75,38 @@ cdef class SequenceRecord:
If quality values are available, this is a string
that contains the Phred-scaled qualities encoded as
ASCII(qual+33) (as in FASTQ).
+
+ Raises:
+ ValueError: One of the provide attributes is not ASCII or
+ the lengths of sequence and qualities differ
"""
cdef:
object _name
object _sequence
object _qualities
- def __cinit__(self, object name, object sequence, object qualities=None):
- """Set qualities to None if there are no quality values"""
- self._name = name
- self._sequence = sequence
- self._qualities = qualities
-
def __init__(self, object name, object sequence, object qualities = None):
- # __cinit__ is called first and sets all the variables.
if not PyUnicode_CheckExact(name):
raise TypeError(f"name should be of type str, got {type(name)}")
if not PyUnicode_IS_COMPACT_ASCII(name):
- raise ValueError("name must be a valid ASCII-string.")
+ raise ValueError(is_not_ascii_message("name", name))
if not PyUnicode_CheckExact(sequence):
raise TypeError(f"sequence should be of type str, got {type(sequence)}")
if not PyUnicode_IS_COMPACT_ASCII(sequence):
- raise ValueError("sequence must be a valid ASCII-string.")
+ raise ValueError(is_not_ascii_message("sequence", sequence))
if qualities is not None:
if not PyUnicode_CheckExact(qualities):
raise TypeError(f"qualities should be of type str, got {type(qualities)}")
if not PyUnicode_IS_COMPACT_ASCII(qualities):
- raise ValueError("qualities must be a valid ASCII-string.")
+ raise ValueError(is_not_ascii_message("qualities", qualities))
if len(qualities) != len(sequence):
rname = shorten(name)
raise ValueError("In read named {!r}: length of quality sequence "
"({}) and length of read ({}) do not match".format(
rname, len(qualities), len(sequence)))
+ self._name = name
+ self._sequence = sequence
+ self._qualities = qualities
@property
def name(self):
@@ -99,7 +117,7 @@ cdef class SequenceRecord:
if not PyUnicode_CheckExact(name):
raise TypeError(f"name must be of type str, got {type(name)}")
if not PyUnicode_IS_COMPACT_ASCII(name):
- raise ValueError("name must be a valid ASCII-string.")
+ raise ValueError(is_not_ascii_message("name", name))
self._name = name
@property
@@ -111,7 +129,7 @@ cdef class SequenceRecord:
if not PyUnicode_CheckExact(sequence):
raise TypeError(f"sequence must be of type str, got {type(sequence)}")
if not PyUnicode_IS_COMPACT_ASCII(sequence):
- raise ValueError("sequence must be a valid ASCII-string.")
+ raise ValueError(is_not_ascii_message("sequence", sequence))
self._sequence = sequence
@property
@@ -122,7 +140,7 @@ cdef class SequenceRecord:
def qualities(self, qualities):
if PyUnicode_CheckExact(qualities):
if not PyUnicode_IS_COMPACT_ASCII(qualities):
- raise ValueError("qualities must be a valid ASCII-string.")
+ raise ValueError(is_not_ascii_message("qualities", qualities))
elif qualities is None:
pass
else:
@@ -267,6 +285,13 @@ cdef class SequenceRecord:
header2_length, id1_ends_with_number)
def reverse_complement(self):
+ """
+ Return a reverse-complemented version of this record.
+
+ - The name remains unchanged.
+ - The sequence is reverse complemented.
+ - If quality values exist, their order is reversed.
+ """
cdef:
Py_ssize_t sequence_length = PyUnicode_GET_LENGTH(self._sequence)
object reversed_sequence_obj = PyUnicode_New(sequence_length, 127)
@@ -277,6 +302,7 @@ cdef class SequenceRecord:
char *qualities
Py_ssize_t cursor, reverse_cursor
unsigned char nucleotide
+ SequenceRecord seq_record
reverse_cursor = sequence_length
for cursor in range(sequence_length):
reverse_cursor -= 1
@@ -293,17 +319,21 @@ cdef class SequenceRecord:
reversed_qualities[reverse_cursor] = qualities[cursor]
else:
reversed_qualities_obj = None
- return SequenceRecord.__new__(
- SequenceRecord, self._name, reversed_sequence_obj, reversed_qualities_obj)
+ seq_record = SequenceRecord.__new__(SequenceRecord)
+ seq_record._name = self._name
+ seq_record._sequence = reversed_sequence_obj
+ seq_record._qualities = reversed_qualities_obj
+ return seq_record
def paired_fastq_heads(buf1, buf2, Py_ssize_t end1, Py_ssize_t end2):
"""
Skip forward in the two buffers by multiples of four lines.
- Return a tuple (length1, length2) such that buf1[:length1] and
- buf2[:length2] contain the same number of lines (where the
- line number is divisible by four).
+ Returns:
+ A tuple (length1, length2) such that buf1[:length1] and
+ buf2[:length2] contain the same number of lines (where the
+ line number is divisible by four).
"""
# Acquire buffers. Cython automatically checks for errors here.
cdef Py_buffer data1_buffer
@@ -349,15 +379,21 @@ cdef class FastqIter:
"""
Parse a FASTQ file and yield SequenceRecord objects
- The *first value* that the generator yields is a boolean indicating whether
- the first record in the FASTQ has a repeated header (in the third row
- after the ``+``).
+ Arguments:
+ file: a file-like object, opened in binary mode (it must have a readinto
+ method)
- file -- a file-like object, opened in binary mode (it must have a readinto
- method)
+ sequence_class: A custom class to use for the returned instances
+ (instead of SequenceRecord)
- buffer_size -- size of the initial buffer. This is automatically grown
- if a FASTQ record is encountered that does not fit.
+ buffer_size: size of the initial buffer. This is automatically grown
+ if a FASTQ record is encountered that does not fit.
+
+ Yields:
+ The *first value* that the generator yields is a boolean indicating whether
+ the first record in the FASTQ has a repeated header (in the third row
+ after the ``+``). Subsequent values are SequenceRecord objects (or whichever
+ objects sequence_class returned if specified)
"""
cdef:
Py_ssize_t buffer_size
@@ -461,6 +497,7 @@ cdef class FastqIter:
def __next__(self):
cdef:
object ret_val
+ SequenceRecord seq_record
char *name_start
char *name_end
char *sequence_start
@@ -470,6 +507,7 @@ cdef class FastqIter:
char *qualities_start
char *qualities_end
char *buffer_end
+ size_t remaining_bytes
Py_ssize_t name_length, sequence_length, second_header_length, qualities_length
# Repeatedly attempt to parse the buffer until we have found a full record.
# If an attempt fails, we read more data before retrying.
@@ -495,10 +533,15 @@ cdef class FastqIter:
self._read_into_buffer()
continue
second_header_start = sequence_end + 1
- second_header_end = <char *>memchr(second_header_start, b'\n', <size_t>(buffer_end - second_header_start))
- if second_header_end == NULL:
- self._read_into_buffer()
- continue
+ remaining_bytes = (buffer_end - second_header_start)
+ # Usually there is no second header, so we skip the memchr call.
+ if remaining_bytes > 2 and second_header_start[0] == b'+' and second_header_start[1] == b'\n':
+ second_header_end = second_header_start + 1
+ else:
+ second_header_end = <char *>memchr(second_header_start, b'\n', <size_t>(remaining_bytes))
+ if second_header_end == NULL:
+ self._read_into_buffer()
+ continue
qualities_start = second_header_end + 1
qualities_end = <char *>memchr(qualities_start, b'\n', <size_t>(buffer_end - qualities_start))
if qualities_end == NULL:
@@ -564,8 +607,11 @@ cdef class FastqIter:
if self.use_custom_class:
ret_val = self.sequence_class(name, sequence, qualities)
else:
- ret_val = SequenceRecord.__new__(SequenceRecord, name, sequence, qualities)
-
+ seq_record = SequenceRecord.__new__(SequenceRecord)
+ seq_record._name = name
+ seq_record._sequence = sequence
+ seq_record._qualities = qualities
+ ret_val = seq_record
# Advance record to next position
self.number_of_records += 1
self.record_start = qualities_end + 1
=====================================
src/dnaio/_util.py
=====================================
@@ -1,24 +1,3 @@
-import pathlib
-
-
-def _is_path(obj: object) -> bool:
- """
- Return whether the given object looks like a path (str, pathlib.Path or pathlib2.Path)
- """
- # TODO
- # pytest uses pathlib2.Path objects on Python 3.5 for its tmp_path fixture.
- # On Python 3.6+, this function can be replaced with isinstance(obj, os.PathLike)
- import sys
-
- if "pathlib2" in sys.modules:
- import pathlib2 # type: ignore
-
- path_classes = [str, pathlib.Path, pathlib2.Path]
- else:
- path_classes = [str, pathlib.Path]
- return isinstance(obj, tuple(path_classes))
-
-
def shorten(s: str, n: int = 100) -> str:
"""Shorten string s to at most n characters, appending "..." if necessary."""
=====================================
src/dnaio/interfaces.py
=====================================
@@ -1,24 +1,42 @@
from abc import ABC, abstractmethod
-from typing import Iterator, Tuple
+from typing import Iterable, Iterator, Tuple
from dnaio import SequenceRecord
class SingleEndReader(ABC):
+ delivers_qualities: bool
+ number_of_records: int
+
@abstractmethod
def __iter__(self) -> Iterator[SequenceRecord]:
- """Yield the records in the input as `SequenceRecord` objects."""
+ """
+ Iterate over an input containing sequence records
+
+ Yields:
+ `SequenceRecord` objects
+
+ Raises:
+ `FileFormatError`
+ if there was a parse error
+ """
class PairedEndReader(ABC):
@abstractmethod
def __iter__(self) -> Iterator[Tuple[SequenceRecord, SequenceRecord]]:
"""
- Yield the records in the paired-end input as pairs of `SequenceRecord` objects.
+ Iterate over an input containing paired-end records
- Raises a `FileFormatError` if reads are improperly paired, that is,
- if there are more reads in one file than the other or if the record IDs
- do not match (according to `SequenceRecord.is_mate`).
+ Yields:
+ Pairs of `SequenceRecord` objects
+
+ Raises:
+ `FileFormatError`
+ if there was a parse error or if reads are improperly paired,
+ that is, if there are more reads in one file than the other or
+ if the record IDs do not match (according to
+ `SequenceRecord.is_mate`).
"""
@@ -38,5 +56,30 @@ class PairedEndWriter(ABC):
because this was already done at parsing time. If it is possible
that the record IDs no longer match, check that
``record1.is_mate(record2)`` returns True before calling
- this function.
+ this method.
+ """
+
+
+class MultipleFileWriter(ABC):
+ _number_of_files: int
+
+ @abstractmethod
+ def write(self, *records: SequenceRecord) -> None:
+ """
+ Write N SequenceRecords to the output. N must be equal
+ to the number of files the MultipleFileWriter was initialized with.
+
+ This method does not check whether the records are properly paired.
+ """
+
+ @abstractmethod
+ def write_iterable(self, list_of_records: Iterable[Tuple[SequenceRecord, ...]]):
+ """
+ Iterate over the list (or other iterable container) and write all
+ N-tuples of SequenceRecord to disk. N must be equal
+ to the number of files the MultipleFileWriter was initialized with.
+
+ This method does not check whether the records are properly paired.
+ This method may provide a speed boost over calling write for each
+ tuple of SequenceRecords individually.
"""
=====================================
src/dnaio/multipleend.py
=====================================
@@ -0,0 +1,258 @@
+import contextlib
+import os
+from os import PathLike
+from typing import BinaryIO, IO, Iterable, Iterator, List, Optional, Tuple, Union
+
+from xopen import xopen
+
+from ._core import SequenceRecord, records_are_mates
+from .exceptions import FileFormatError
+from .interfaces import MultipleFileWriter
+from .readers import FastaReader, FastqReader
+from .singleend import _open_single, _detect_format_from_name
+from .writers import FastaWriter, FastqWriter
+
+
+def _open_multiple(
+ *files: Union[str, PathLike, BinaryIO],
+ fileformat: Optional[str] = None,
+ mode: str = "r",
+ qualities: Optional[bool] = None,
+ opener=xopen,
+):
+ if not files:
+ raise ValueError("At least one file is required")
+ if mode not in ("r", "w", "a"):
+ raise ValueError("Mode must be one of 'r', 'w', 'a'")
+ elif mode == "r":
+ return MultipleFileReader(*files, fileformat=fileformat, opener=opener)
+ elif mode == "w" and fileformat is None:
+ # Assume mixed files will not be offered.
+ for file in files:
+ if isinstance(file, (str, os.PathLike)):
+ fileformat = _detect_format_from_name(os.fspath(file))
+ append = mode == "a"
+ if fileformat == "fastq" or qualities or (fileformat is None and qualities is None):
+ return MultipleFastqWriter(*files, opener=opener, append=append)
+ return MultipleFastaWriter(*files, opener=opener, append=append)
+
+
+class MultipleFileReader:
+ """
+ Read multiple FASTA/FASTQ files simultaneously. Useful when additional
+ FASTQ files with extra information are supplied (UMIs, indices etc.)..
+
+ While this class can be instantiated directly, the recommended way is to
+ use `dnaio.open` with appropriate arguments.
+ """
+
+ def __init__(
+ self,
+ *files: Union[str, PathLike, BinaryIO],
+ fileformat: Optional[str] = None,
+ opener=xopen,
+ ):
+ if len(files) < 1:
+ raise ValueError("At least one file is required")
+ self._files = files
+ self._stack = contextlib.ExitStack()
+ self._readers: List[Union[FastaReader, FastqReader]] = [
+ self._stack.enter_context(
+ _open_single( # type: ignore
+ file, opener=opener, fileformat=fileformat, mode="r"
+ )
+ )
+ for file in self._files
+ ]
+ self.delivers_qualities: bool = self._readers[0].delivers_qualities
+
+ def __repr__(self) -> str:
+ return (
+ f"{self.__class__.__name__}"
+ f"({', '.join(repr(reader) for reader in self._readers)})"
+ )
+
+ def __iter__(self) -> Iterator[Tuple[SequenceRecord, ...]]:
+ """
+ Iterate over multiple inputs containing records
+
+ Yields:
+ N-tuples of `SequenceRecord` objects where N is equal to the number
+ of files.
+
+ Raises:
+ `FileFormatError`
+ if there was a parse error or if reads are improperly paired,
+ that is, if there are more reads in one file than the others or
+ if the record IDs do not match (according to
+ `records_are_mates`).
+ """
+ if len(self._files) == 1:
+ yield from zip(self._readers[0])
+ else:
+ for records in zip(*self._readers):
+ if not records_are_mates(*records):
+ raise FileFormatError(
+ f"Records are out of sync, names "
+ f"{', '.join(repr(r.name) for r in records)} do not match.",
+ line=None,
+ )
+ yield records
+ # Consume one iteration to check if all the files have an equal number
+ # of records.
+ for reader in self._readers:
+ try:
+ _ = next(iter(reader))
+ except StopIteration:
+ pass
+ record_numbers = [r.number_of_records for r in self._readers]
+ if len(set(record_numbers)) != 1:
+ raise FileFormatError(
+ f"Files: {', '.join(str(file) for file in self._files)} have "
+ f"an unequal amount of reads.",
+ line=None,
+ )
+
+ def close(self):
+ self._stack.close()
+
+ def __enter__(self):
+ return self
+
+ def __exit__(self, *exc):
+ self.close()
+
+
+class MultipleFastaWriter(MultipleFileWriter):
+ """
+ Write multiple FASTA files simultaneously.
+
+ While this class can be instantiated directly, the recommended way is to
+ use `dnaio.open` with appropriate arguments.
+ """
+
+ def __init__(
+ self,
+ *files: Union[str, PathLike, BinaryIO],
+ opener=xopen,
+ append: bool = False,
+ ):
+ if len(files) < 1:
+ raise ValueError("At least one file is required")
+ mode = "a" if append else "w"
+ self._files = files
+ self._number_of_files = len(files)
+ self._stack = contextlib.ExitStack()
+ self._writers: List[Union[FastaWriter, FastqWriter]] = [
+ self._stack.enter_context(
+ _open_single( # type: ignore
+ file,
+ opener=opener,
+ fileformat="fasta",
+ mode=mode,
+ qualities=False,
+ )
+ )
+ for file in self._files
+ ]
+
+ def __repr__(self) -> str:
+ return (
+ f"{self.__class__.__name__}"
+ f"({', '.join(repr(writer) for writer in self._writers)})"
+ )
+
+ def close(self):
+ self._stack.close()
+
+ def write(self, *records: SequenceRecord):
+ if len(records) != self._number_of_files:
+ raise ValueError(f"records must have length {self._number_of_files}")
+ for record, writer in zip(records, self._writers):
+ writer.write(record)
+
+ def write_iterable(self, records_iterable: Iterable[Tuple[SequenceRecord, ...]]):
+ for records in records_iterable:
+ self.write(*records)
+
+ def __enter__(self):
+ return self
+
+ def __exit__(self, *exc):
+ self.close()
+
+
+class MultipleFastqWriter(MultipleFileWriter):
+ """
+ Write multiple FASTA files simultaneously.
+
+ While this class can be instantiated directly, the recommended way is to
+ use `dnaio.open` with appropriate arguments.
+ """
+
+ def __init__(
+ self,
+ *files: Union[str, PathLike, BinaryIO],
+ opener=xopen,
+ append: bool = False,
+ ):
+ if len(files) < 1:
+ raise ValueError("At least one file is required")
+ mode = "a" if append else "w"
+ self._files = files
+ self._number_of_files = len(files)
+ self._stack = contextlib.ExitStack()
+ self._writers: List[IO] = [
+ self._stack.enter_context(
+ opener(file, mode + "b") if not hasattr(file, "write") else file
+ )
+ for file in self._files
+ ]
+
+ def __repr__(self) -> str:
+ return (
+ f"{self.__class__.__name__}" f"({', '.join(str(f) for f in self._files)})"
+ )
+
+ def close(self):
+ self._stack.close()
+
+ def write(self, *records: SequenceRecord):
+ if len(records) != self._number_of_files:
+ raise ValueError(f"records must have length {self._number_of_files}")
+ for record, writer in zip(records, self._writers):
+ writer.write(record.fastq_bytes())
+
+ def write_iterable(self, records_iterable: Iterable[Tuple[SequenceRecord, ...]]):
+ # Use faster methods for more common cases before falling back to
+ # generic multiple files mode (which is much slower due to calling the
+ # zip function).
+ if self._number_of_files == 1:
+ output = self._writers[0]
+ for (record,) in records_iterable:
+ output.write(record.fastq_bytes())
+ elif self._number_of_files == 2:
+ output1 = self._writers[0]
+ output2 = self._writers[1]
+ for record1, record2 in records_iterable:
+ output1.write(record1.fastq_bytes())
+ output2.write(record2.fastq_bytes())
+ elif self._number_of_files == 3:
+ output1 = self._writers[0]
+ output2 = self._writers[1]
+ output3 = self._writers[2]
+ for record1, record2, record3 in records_iterable:
+ output1.write(record1.fastq_bytes())
+ output2.write(record2.fastq_bytes())
+ output3.write(record3.fastq_bytes())
+ else: # More than 3 files is quite uncommon.
+ writers = self._writers
+ for records in records_iterable:
+ for record, output in zip(records, writers):
+ output.write(record.fastq_bytes())
+
+ def __enter__(self):
+ return self
+
+ def __exit__(self, *exc):
+ self.close()
=====================================
src/dnaio/pairedend.py
=====================================
@@ -13,11 +13,8 @@ from .singleend import _open_single
def _open_paired(
- file1: Union[str, PathLike, BinaryIO],
- *,
- file2: Optional[Union[str, PathLike, BinaryIO]] = None,
+ *files: Union[str, PathLike, BinaryIO],
fileformat: Optional[str] = None,
- interleaved: bool = False,
mode: str = "r",
qualities: Optional[bool] = None,
opener=xopen,
@@ -25,43 +22,43 @@ def _open_paired(
"""
Open paired-end reads
"""
- if interleaved and file2 is not None:
- raise ValueError("When interleaved is True, file2 must be None")
- if file2 is not None:
- if mode in "wa" and file1 == file2:
+ if len(files) == 2:
+ if mode in "wa" and files[0] == files[1]:
raise ValueError("The paired-end output files are identical")
if "r" in mode:
return TwoFilePairedEndReader(
- file1, file2, fileformat=fileformat, opener=opener, mode=mode
+ *files, fileformat=fileformat, opener=opener, mode=mode
)
append = mode == "a"
return TwoFilePairedEndWriter(
- file1,
- file2,
+ *files,
fileformat=fileformat,
qualities=qualities,
opener=opener,
append=append,
)
- if interleaved:
+ elif len(files) == 1:
if "r" in mode:
return InterleavedPairedEndReader(
- file1, fileformat=fileformat, opener=opener, mode=mode
+ files[0], fileformat=fileformat, opener=opener, mode=mode
)
append = mode == "a"
return InterleavedPairedEndWriter(
- file1,
+ files[0],
fileformat=fileformat,
qualities=qualities,
opener=opener,
append=append,
)
- assert False
+ raise ValueError("_open_paired must be called with one or two files.")
class TwoFilePairedEndReader(PairedEndReader):
"""
- Read paired-end reads from two files.
+ Read paired-end reads from two files (not interleaved)
+
+ While this class can be instantiated directly, the recommended way is to
+ use `dnaio.open` with appropriate arguments.
"""
paired = True
@@ -92,7 +89,7 @@ class TwoFilePairedEndReader(PairedEndReader):
def __iter__(self) -> Iterator[Tuple[SequenceRecord, SequenceRecord]]:
"""
Iterate over the paired reads.
- Each yielded item is a pair of SequenceRecord objects.
+ Each yielded item is a pair of `SequenceRecord` objects.
Raises a `FileFormatError` if reads are improperly paired.
"""
@@ -139,7 +136,10 @@ class TwoFilePairedEndReader(PairedEndReader):
class InterleavedPairedEndReader(PairedEndReader):
"""
- Read paired-end reads from an interleaved FASTQ file.
+ Read paired-end reads from an interleaved FASTQ file
+
+ While this class can be instantiated directly, the recommended way is to
+ use `dnaio.open` with appropriate arguments.
"""
paired = True
@@ -191,6 +191,13 @@ class InterleavedPairedEndReader(PairedEndReader):
class TwoFilePairedEndWriter(PairedEndWriter):
+ """
+ Write paired-end reads to two files (not interleaved)
+
+ While this class can be instantiated directly, the recommended way is to
+ use `dnaio.open` with appropriate arguments.
+ """
+
def __init__(
self,
file1: Union[str, PathLike, BinaryIO],
@@ -246,6 +253,9 @@ class TwoFilePairedEndWriter(PairedEndWriter):
class InterleavedPairedEndWriter(PairedEndWriter):
"""
Write paired-end reads to an interleaved FASTA or FASTQ file
+
+ While this class can be instantiated directly, the recommended way is to
+ use `dnaio.open` with appropriate arguments.
"""
def __init__(
=====================================
src/dnaio/readers.py
=====================================
@@ -64,6 +64,9 @@ class BinaryFileReader:
class FastaReader(BinaryFileReader, SingleEndReader):
"""
Reader for FASTA files
+
+ While this class can be instantiated directly, the recommended way is to
+ use `dnaio.open` with appropriate arguments.
"""
def __init__(
@@ -104,7 +107,14 @@ class FastaReader(BinaryFileReader, SingleEndReader):
if line and line[0] == ">":
if name is not None:
self.number_of_records += 1
- yield self.sequence_class(name, self._delimiter.join(seq), None)
+ try:
+ yield self.sequence_class(name, self._delimiter.join(seq), None)
+ except ValueError as e:
+ raise FastaFormatError(
+ str(e)
+ + " (line number refers to record after the problematic one)",
+ line=i,
+ )
name = line[1:]
seq = []
elif line and line[0] == "#":
@@ -119,12 +129,18 @@ class FastaReader(BinaryFileReader, SingleEndReader):
if name is not None:
self.number_of_records += 1
- yield self.sequence_class(name, self._delimiter.join(seq), None)
+ try:
+ yield self.sequence_class(name, self._delimiter.join(seq), None)
+ except ValueError as e:
+ raise FastaFormatError(str(e), line=None)
class FastqReader(BinaryFileReader, SingleEndReader):
"""
Reader for FASTQ files. Does not support multi-line FASTQ files.
+
+ While this class can be instantiated directly, the recommended way is to
+ use `dnaio.open` with appropriate arguments.
"""
def __init__(
=====================================
src/dnaio/singleend.py
=====================================
@@ -1,10 +1,9 @@
import os
-from typing import Optional, Union, BinaryIO
+from typing import Optional, Union, BinaryIO, Tuple
from .exceptions import UnknownFileFormat
from .readers import FastaReader, FastqReader
from .writers import FastaWriter, FastqWriter
-from ._util import _is_path
def _open_single(
@@ -21,22 +20,8 @@ def _open_single(
if mode not in ("r", "w", "a"):
raise ValueError("Mode must be 'r', 'w' or 'a'")
- path: Optional[str]
- if _is_path(file_or_path):
- path = os.fspath(file_or_path) # type: ignore
- file = opener(path, mode[0] + "b")
- close_file = True
- else:
- if "r" in mode and not hasattr(file_or_path, "readinto"):
- raise ValueError(
- "When passing in an open file-like object, it must have been opened in binary mode"
- )
- file = file_or_path
- if hasattr(file, "name") and isinstance(file.name, str):
- path = file.name
- else:
- path = None
- close_file = False
+ close_file, file, path = _open_file_or_path(file_or_path, mode, opener)
+ del file_or_path
if path is not None and fileformat is None:
fileformat = _detect_format_from_name(path)
@@ -75,11 +60,38 @@ def _open_single(
return FastqReader(file, _close_file=close_file)
return FastqWriter(file, _close_file=close_file)
+ if close_file:
+ file.close()
raise UnknownFileFormat(
f"File format '{fileformat}' is unknown (expected 'fasta' or 'fastq')."
)
+def _open_file_or_path(
+ file_or_path: Union[str, os.PathLike, BinaryIO], mode: str, opener
+) -> Tuple[bool, BinaryIO, Optional[str]]:
+ path: Optional[str]
+ file: BinaryIO
+ try:
+ path = os.fspath(file_or_path) # type: ignore
+ except TypeError:
+ if "r" in mode and not hasattr(file_or_path, "readinto"):
+ raise ValueError(
+ "When passing in an open file-like object, it must have been opened in binary mode"
+ )
+ file = file_or_path # type: ignore
+ if hasattr(file, "name") and isinstance(file.name, str):
+ path = file.name
+ else:
+ path = None
+ close_file = False
+ else:
+ file = opener(path, mode[0] + "b")
+ close_file = True
+
+ return close_file, file, path
+
+
def _detect_format_from_name(name: str) -> Optional[str]:
"""
name -- file name
@@ -87,7 +99,7 @@ def _detect_format_from_name(name: str) -> Optional[str]:
Return 'fasta', 'fastq' or None if the format could not be detected.
"""
name = name.lower()
- for ext in (".gz", ".xz", ".bz2"):
+ for ext in (".gz", ".xz", ".bz2", ".zst"):
if name.endswith(ext):
name = name[: -len(ext)]
break
=====================================
src/dnaio/writers.py
=====================================
@@ -1,10 +1,10 @@
+import os
from os import PathLike
from typing import Union, BinaryIO, Optional
from xopen import xopen
from . import SequenceRecord
-from ._util import _is_path
from .interfaces import SingleEndWriter
@@ -13,6 +13,8 @@ class FileWriter:
A mix-in that manages opening and closing and provides a context manager
"""
+ _file: BinaryIO
+
def __init__(
self,
file: Union[PathLike, str, BinaryIO],
@@ -20,12 +22,15 @@ class FileWriter:
opener=xopen,
_close_file: Optional[bool] = None,
):
- if _is_path(file):
+ try:
+ os.fspath(file) # type: ignore
+ except TypeError:
+ # Assume it’s an open file-like object
+ self._file = file # type: ignore
+ self._close_on_exit = bool(_close_file)
+ else:
self._file = opener(file, "wb")
self._close_on_exit = True
- else:
- self._file = file
- self._close_on_exit = bool(_close_file)
def __repr__(self) -> str:
return f"{self.__class__.__name__}('{getattr(self._file, 'name', self._file)}')"
@@ -45,7 +50,14 @@ class FileWriter:
class FastaWriter(FileWriter, SingleEndWriter):
"""
- Write FASTA-formatted sequences to a file.
+ Write FASTA-formatted sequences to a file
+
+ While this class can be instantiated directly, the recommended way is to
+ use `dnaio.open` with appropriate arguments unless you need to set the
+ line_length argument.
+
+ Arguments:
+ line_length: Wrap sequence lines after this many characters (None disables wrapping)
"""
def __init__(
@@ -56,13 +68,6 @@ class FastaWriter(FileWriter, SingleEndWriter):
opener=xopen,
_close_file: Optional[bool] = None,
):
- """
-
- Arguments:
- file: A path or an open file-like object
- line_length: Wrap sequence lines after this many characters (None disables wrapping)
- opener: If *file* is a path, this function is called to open it.
- """
super().__init__(file, opener=opener, _close_file=_close_file)
self.line_length = line_length if line_length != 0 else None
@@ -101,14 +106,15 @@ class FastaWriter(FileWriter, SingleEndWriter):
class FastqWriter(FileWriter, SingleEndWriter):
"""
- Write records in FASTQ format.
+ Write records in FASTQ format
- FASTQ files are formatted like this::
+ While this class can be instantiated directly, the recommended way is to
+ use `dnaio.open` with appropriate arguments unless you need to set
+ two_headers to True.
- @read name
- AACCGGTT
- +
- FF,:F,,F
+ Arguments:
+ two_headers: If True, the header is repeated on the third line
+ of each record after the "+".
"""
file_mode = "wb"
@@ -121,13 +127,6 @@ class FastqWriter(FileWriter, SingleEndWriter):
opener=xopen,
_close_file: Optional[bool] = None,
):
- """
- Arguments:
- file: A path or an open file-like object
- two_headers: If True, the header is repeated on the third line
- of each record after the "+".
- opener: If *file* is a path, this function is called to open it.
- """
super().__init__(file, opener=opener, _close_file=_close_file)
self._two_headers = two_headers
# setattr avoids a complaint from Mypy
=====================================
tests/test_chunks.py
=====================================
@@ -1,6 +1,7 @@
from pytest import raises
from io import BytesIO
+from dnaio import UnknownFileFormat
from dnaio._core import paired_fastq_heads
from dnaio.chunks import _fastq_head, _fasta_head, read_chunks, read_paired_chunks
@@ -84,3 +85,8 @@ def test_read_chunks():
def test_read_chunks_empty():
assert list(read_chunks(BytesIO(b""))) == []
+
+
+def test_invalid_file_format():
+ with raises(UnknownFileFormat):
+ list(read_chunks(BytesIO(b"invalid format")))
=====================================
tests/test_internal.py
=====================================
@@ -22,8 +22,10 @@ from dnaio import (
FastqWriter,
InterleavedPairedEndWriter,
TwoFilePairedEndReader,
+ records_are_mates,
+ record_names_match,
+ SequenceRecord,
)
-from dnaio import records_are_mates, record_names_match, SequenceRecord
from dnaio.writers import FileWriter
from dnaio.readers import BinaryFileReader
@@ -264,10 +266,10 @@ class TestFastqReader:
class TestOpen:
- def setup(self):
+ def setup_method(self):
self._tmpdir = mkdtemp()
- def teardown(self):
+ def teardown_method(self):
shutil.rmtree(self._tmpdir)
def test_sequence_reader(self):
@@ -339,7 +341,7 @@ class TestOpen:
path = os.path.join(self._tmpdir, "tmp.fastq")
with raises(ValueError):
with dnaio.open(path, mode="w", qualities=False):
- pass
+ pass # pragma: no cover
class TestInterleavedReader:
@@ -386,11 +388,11 @@ class TestInterleavedReader:
class TestFastaWriter:
- def setup(self):
+ def setup_method(self):
self._tmpdir = mkdtemp()
self.path = os.path.join(self._tmpdir, "tmp.fasta")
- def teardown(self):
+ def teardown_method(self):
shutil.rmtree(self._tmpdir)
def test(self):
@@ -436,11 +438,11 @@ class TestFastaWriter:
class TestFastqWriter:
- def setup(self):
+ def setup_method(self):
self._tmpdir = mkdtemp()
self.path = os.path.join(self._tmpdir, "tmp.fastq")
- def teardown(self):
+ def teardown_method(self):
shutil.rmtree(self._tmpdir)
def test(self):
@@ -608,7 +610,7 @@ def test_file_writer(tmp_path):
assert path.exists()
with raises(ValueError) as e:
with fw:
- pass
+ pass # pragma: no coverage
assert "operation on closed file" in e.value.args[0]
@@ -618,7 +620,7 @@ def test_binary_file_reader():
bfr.close()
with raises(ValueError) as e:
with bfr:
- pass
+ pass # pragma: no coverage
assert "operation on closed" in e.value.args[0]
=====================================
tests/test_multiple.py
=====================================
@@ -0,0 +1,121 @@
+import io
+import itertools
+import os
+from pathlib import Path
+
+import dnaio
+from dnaio import SequenceRecord, _open_multiple
+
+import pytest
+
+
+ at pytest.mark.parametrize(
+ ["fileformat", "number_of_files"],
+ itertools.product(("fasta", "fastq"), (1, 2, 3, 4)),
+)
+def test_read_files(fileformat, number_of_files):
+ file = Path(__file__).parent / "data" / ("simple." + fileformat)
+ files = [file for _ in range(number_of_files)]
+ with _open_multiple(*files) as multiple_reader:
+ for records in multiple_reader:
+ pass
+ assert len(records) == number_of_files
+ assert isinstance(records, tuple)
+
+
+ at pytest.mark.parametrize(
+ "kwargs",
+ [
+ dict(mode="w", fileformat="fasta"),
+ dict(mode="r"),
+ dict(mode="w", fileformat="fastq"),
+ ],
+)
+def test_open_no_file_error(kwargs):
+ with pytest.raises(ValueError):
+ _open_multiple(**kwargs)
+
+
+def test_open_multiple_unsupported_mode():
+ with pytest.raises(ValueError) as error:
+ _open_multiple(os.devnull, mode="X")
+ error.match("one of 'r', 'w', 'a'")
+
+
+ at pytest.mark.parametrize(
+ ["number_of_files", "content"],
+ itertools.product(
+ (1, 2, 3, 4), (">my_fasta\nAGCTAGA\n", "@my_fastq\nAGC\n+\nHHH\n")
+ ),
+)
+def test_multiple_binary_read(number_of_files, content):
+ files = [io.BytesIO(content.encode("ascii")) for _ in range(number_of_files)]
+ with _open_multiple(*files) as reader:
+ for records_tup in reader:
+ pass
+
+
+ at pytest.mark.parametrize(
+ ["number_of_files", "fileformat"],
+ itertools.product((1, 2, 3, 4), ("fastq", "fasta")),
+)
+def test_multiple_binary_write(number_of_files, fileformat):
+ files = [io.BytesIO() for _ in range(number_of_files)]
+ records = [SequenceRecord("A", "A", "A") for _ in range(number_of_files)]
+ with _open_multiple(*files, mode="w", fileformat=fileformat) as writer:
+ writer.write(*records)
+
+
+ at pytest.mark.parametrize(
+ ["number_of_files", "fileformat"],
+ itertools.product((1, 2, 3, 4), ("fastq", "fasta")),
+)
+def test_multiple_write_too_many(number_of_files, fileformat):
+ files = [io.BytesIO() for _ in range(number_of_files)]
+ records = [SequenceRecord("A", "A", "A") for _ in range(number_of_files + 1)]
+ with _open_multiple(*files, mode="w", fileformat=fileformat) as writer:
+ with pytest.raises(ValueError) as error:
+ writer.write(*records)
+ error.match(str(number_of_files))
+
+
+ at pytest.mark.parametrize(
+ ["number_of_files", "fileformat"],
+ itertools.product((1, 2, 3, 4), ("fastq", "fasta")),
+)
+def test_multiple_write_iterable(number_of_files, fileformat):
+ files = [io.BytesIO() for _ in range(number_of_files)]
+ records = [SequenceRecord("A", "A", "A") for _ in range(number_of_files)]
+ records_list = [records, records, records]
+ with _open_multiple(*files, mode="w", fileformat=fileformat) as writer:
+ writer.write_iterable(records_list)
+
+
+ at pytest.mark.parametrize("number_of_files", (2, 3, 4))
+def test_multiple_read_unmatched_names(number_of_files):
+ record1_content = b"@my_fastq\nAGC\n+\nHHH\n"
+ record2_content = b"@my_fasterq\nAGC\n+\nHHH\n"
+ files = (
+ io.BytesIO(record1_content),
+ *(io.BytesIO(record2_content) for _ in range(number_of_files - 1)),
+ )
+ with _open_multiple(*files) as reader:
+ with pytest.raises(dnaio.FileFormatError) as error:
+ for records in reader:
+ pass
+ error.match("do not match")
+
+
+ at pytest.mark.parametrize("number_of_files", (2, 3, 4))
+def test_multiple_read_out_of_sync(number_of_files):
+ record1_content = b"@my_fastq\nAGC\n+\nHHH\n"
+ record2_content = b"@my_fastq\nAGC\n+\nHHH\n at my_secondfastq\nAGC\n+\nHHH\n"
+ files = (
+ io.BytesIO(record1_content),
+ *(io.BytesIO(record2_content) for _ in range(number_of_files - 1)),
+ )
+ with _open_multiple(*files) as reader:
+ with pytest.raises(dnaio.FileFormatError) as error:
+ for records in reader:
+ pass
+ error.match("unequal amount")
=====================================
tests/test_open.py
=====================================
@@ -1,9 +1,11 @@
+import os
from pathlib import Path
-import dnaio
+import pytest
from xopen import xopen
-import pytest
+import dnaio
+from dnaio import FileFormatError, UnknownFileFormat
@pytest.fixture(params=["", ".gz", ".bz2", ".xz"])
@@ -31,17 +33,12 @@ SIMPLE_RECORDS = {
def formatted_sequence(record, fileformat):
if fileformat == "fastq":
return "@{}\n{}\n+\n{}\n".format(record.name, record.sequence, record.qualities)
- elif fileformat == "fastq_bytes":
- return b"@%b\n%b\n+\n%b\n" % (record.name, record.sequence, record.qualities)
else:
return ">{}\n{}\n".format(record.name, record.sequence)
def formatted_sequences(records, fileformat):
- record_iter = (formatted_sequence(record, fileformat) for record in records)
- if fileformat == "fastq_bytes":
- return b"".join(record_iter)
- return "".join(record_iter)
+ return "".join(formatted_sequence(record, fileformat) for record in records)
def test_formatted_sequence():
@@ -57,7 +54,7 @@ def test_version():
def test_open_nonexistent(tmp_path):
with pytest.raises(FileNotFoundError):
with dnaio.open(tmp_path / "nonexistent"):
- pass
+ pass # pragma: no cover
def test_open_empty_file_with_unrecognized_extension(tmp_path):
@@ -68,6 +65,43 @@ def test_open_empty_file_with_unrecognized_extension(tmp_path):
assert records == []
+def test_fileformat_error(tmp_path):
+ with open(tmp_path / "file.fastq", mode="w") as f:
+ print("this is not a FASTQ file", file=f)
+ with pytest.raises(FileFormatError) as e:
+ with dnaio.open(tmp_path / "file.fastq") as f:
+ _ = list(f) # pragma: no cover
+ assert "at line 2" in str(e.value) # Premature end of file
+
+
+def test_write_unknown_file_format(tmp_path):
+ with pytest.raises(UnknownFileFormat):
+ with dnaio.open(tmp_path / "out.txt", mode="w") as f:
+ f.write(dnaio.SequenceRecord("name", "ACG", "###")) # pragma: no cover
+
+
+def test_read_unknown_file_format(tmp_path):
+ with open(tmp_path / "file.txt", mode="w") as f:
+ print("text file", file=f)
+ with pytest.raises(UnknownFileFormat):
+ with dnaio.open(tmp_path / "file.txt") as f:
+ _ = list(f) # pragma: no cover
+
+
+def test_invalid_format(tmp_path):
+ with pytest.raises(UnknownFileFormat):
+ with dnaio.open(tmp_path / "out.txt", mode="w", fileformat="foo"):
+ pass # pragma: no cover
+
+
+def test_write_qualities_to_file_without_fastq_extension(tmp_path):
+ with dnaio.open(tmp_path / "out.txt", mode="w", qualities=True) as f:
+ f.write(dnaio.SequenceRecord("name", "ACG", "###"))
+
+ with dnaio.open(tmp_path / "out.txt", mode="w", qualities=False) as f:
+ f.write(dnaio.SequenceRecord("name", "ACG", None))
+
+
def test_read(fileformat, extension):
with dnaio.open("tests/data/simple." + fileformat + extension) as f:
records = list(f)
@@ -188,7 +222,7 @@ def test_write_paired_same_path(tmp_path):
path2 = tmp_path / "same.fastq"
with pytest.raises(ValueError):
with dnaio.open(file1=path1, file2=path2, mode="w"):
- pass
+ pass # pragma: no cover
def test_write_paired(tmp_path, fileformat, extension):
@@ -301,3 +335,50 @@ def test_islice_gzip_does_not_fail(tmp_path):
f = dnaio.open(path)
next(iter(f))
f.close()
+
+
+def test_unsupported_mode():
+ with pytest.raises(ValueError) as error:
+ _ = dnaio.open(os.devnull, mode="x")
+ error.match("Mode must be")
+
+
+def test_no_file2_with_multiple_args():
+ with pytest.raises(ValueError) as error:
+ _ = dnaio.open(os.devnull, os.devnull, file2=os.devnull)
+ error.match("as positional argument")
+ error.match("file2")
+
+
+def test_no_multiple_files_interleaved():
+ with pytest.raises(ValueError) as error:
+ _ = dnaio.open(os.devnull, os.devnull, interleaved=True)
+ error.match("interleaved")
+ error.match("one file")
+
+
+ at pytest.mark.parametrize(
+ ["mode", "expected_class"],
+ [("r", dnaio.PairedEndReader), ("w", dnaio.PairedEndWriter)],
+)
+def test_paired_open_with_multiple_args(tmp_path, fileformat, mode, expected_class):
+ path = tmp_path / "file"
+ path2 = tmp_path / "file2"
+ path.touch()
+ path2.touch()
+ with dnaio.open(path, path2, fileformat=fileformat, mode=mode) as f:
+ assert isinstance(f, expected_class)
+
+
+ at pytest.mark.parametrize(
+ ["kwargs", "expected_class"],
+ [
+ ({}, dnaio.multipleend.MultipleFileReader),
+ ({"mode": "w"}, dnaio.multipleend.MultipleFastqWriter),
+ ({"mode": "w", "fileformat": "fastq"}, dnaio.multipleend.MultipleFastqWriter),
+ ({"mode": "w", "fileformat": "fasta"}, dnaio.multipleend.MultipleFastaWriter),
+ ],
+)
+def test_multiple_open_fastq(kwargs, expected_class):
+ with dnaio.open(os.devnull, os.devnull, os.devnull, **kwargs) as f:
+ assert isinstance(f, expected_class)
=====================================
tox.ini
=====================================
@@ -1,5 +1,5 @@
[tox]
-envlist = flake8,black,mypy,docs,py37,py38,py39,py310
+envlist = flake8,black,mypy,docs,py37,py38,py39,py310,py311
isolated_build = True
[testenv]
@@ -7,8 +7,9 @@ deps =
pytest
coverage
commands =
- coverage run --concurrency=multiprocessing -m pytest --doctest-modules --pyargs tests/
+ coverage run -m pytest
coverage combine
+ coverage xml
coverage report
setenv = PYTHONDEVMODE = 1
@@ -46,6 +47,12 @@ source =
src/
*/site-packages/
+[coverage:report]
+precision = 1
+exclude_lines =
+ pragma: no cover
+ def __repr__
+
[flake8]
max-line-length = 99
max-complexity = 15
View it on GitLab: https://salsa.debian.org/med-team/python-dnaio/-/compare/94db0ad1243b81b4373b04aaa9481f56124b1a29...5855df2a71bdc488fc12bbbcb7e915967b4260b4
--
View it on GitLab: https://salsa.debian.org/med-team/python-dnaio/-/compare/94db0ad1243b81b4373b04aaa9481f56124b1a29...5855df2a71bdc488fc12bbbcb7e915967b4260b4
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20221231/680fe4c9/attachment-0001.htm>
More information about the debian-med-commit
mailing list