[Git][debian-gis-team/kerchunk][master] 9 commits: New upstream version 0.2.10
Antonio Valentino (@antonio.valentino)
gitlab at salsa.debian.org
Sat Apr 4 09:57:52 BST 2026
Antonio Valentino pushed to branch master at Debian GIS Project / kerchunk
Commits:
6aa2e938 by Antonio Valentino at 2026-04-03T17:07:03+00:00
New upstream version 0.2.10
- - - - -
9bd6964f by Antonio Valentino at 2026-04-03T17:12:03+00:00
Update upstream source from tag 'upstream/0.2.10'
Update to upstream version '0.2.10'
with Debian dir 875ee3affc657bd3032d42b6485c238f62911395
- - - - -
4f5d577f by Antonio Valentino at 2026-04-03T17:13:06+00:00
New upstream release
- - - - -
f11ddd70 by Antonio Valentino at 2026-04-04T08:41:45+00:00
Refresh all patches
- - - - -
7ef1590e by Antonio Valentino at 2026-04-04T08:41:52+00:00
Add dependency on hdf5plugin
- - - - -
3c6a5fa2 by Antonio Valentino at 2026-04-04T08:41:52+00:00
Add dependency on bitshuffle and hdf5-filter-plugin
- - - - -
b4da93be by Antonio Valentino at 2026-04-04T08:41:52+00:00
Standards version bump
- - - - -
72f9d2f0 by Antonio Valentino at 2026-04-04T08:41:53+00:00
Set distribution to unstable
- - - - -
e5b46fd9 by Antonio Valentino at 2026-04-04T10:57:03+02:00
Merge remote-tracking branch 'origin/master'
- - - - -
24 changed files:
- ci/environment-py311.yml
- ci/environment-py312.yml
- debian/changelog
- debian/control
- debian/patches/0001-Fix-privacy-breaches.patch
- debian/patches/0002-No-internet.patch
- debian/rules
- + docs/source/code-of-conduct.rst
- docs/source/index.rst
- kerchunk/grib2.py
- kerchunk/hdf.py
- kerchunk/utils.py
- tests/gen_hdf5.py
- + tests/gen_mini_hdf5.py
- + tests/grib_idx_fixtures/gec00_20250101_00_soilw/gec00_20250101_00_soilw.grib
- + tests/hdf5_mali_chunk.h5
- + tests/hdf5_mali_chunk2.h5
- + tests/hdf5_mini_blosc.h5
- + tests/hdf5_mini_lz4.h5
- + tests/hdf5_mini_shuffle_blosc.h5
- + tests/hdf5_mini_shuffle_zstd.h5
- + tests/hdf5_mini_zstd.h5
- tests/test_hdf.py
- tests/test_utils.py
Changes:
=====================================
ci/environment-py311.yml
=====================================
@@ -9,6 +9,7 @@ dependencies:
- xarray>=2024.10.0
- h5netcdf
- h5py
+ - hdf5plugin
- pandas
- cfgrib
# Temporary workaround for #508
=====================================
ci/environment-py312.yml
=====================================
@@ -9,6 +9,7 @@ dependencies:
- xarray>=2024.10.0
- h5netcdf
- h5py
+ - hdf5plugin
- pandas
- cfgrib
# Temporary workaround for #508
=====================================
debian/changelog
=====================================
@@ -1,9 +1,16 @@
-kerchunk (0.2.9-5) UNRELEASED; urgency=medium
+kerchunk (0.2.10-1) unstable; urgency=medium
- * Team upload.
- * Bump Standards-Version to 4.7.4, no changes.
+ * New upstream release.
+ * Standards version bumped to 4.7.4, no changes.
+ * debian/patches:
+ - Refresh all patches.
+ * debian/control:
+ - Add dependency on python3-hdf5plugin, bitshuffle and
+ hdf5-filter-plugin.
+ * debian/rules:
+ - Update the list of tests to skip.
- -- Bas Couwenberg <sebastic at debian.org> Sat, 04 Apr 2026 10:12:55 +0200
+ -- Antonio Valentino <antonio.valentino at tiscali.it> Sat, 04 Apr 2026 08:33:04 +0000
kerchunk (0.2.9-4) unstable; urgency=medium
=====================================
debian/control
=====================================
@@ -2,9 +2,11 @@ Source: kerchunk
Section: python
Maintainer: Debian GIS Project <pkg-grass-devel at lists.alioth.debian.org>
Uploaders: Antonio Valentino <antonio.valentino at tiscali.it>
-Build-Depends: debhelper-compat (= 13),
+Build-Depends: bitshuffle,
+ debhelper-compat (= 13),
dh-sequence-python3,
dh-sequence-sphinxdoc <!nodoc>,
+ hdf5-filter-plugin,
pybuild-plugin-pyproject,
python3-aiohttp <!nocheck>,
python3-all,
@@ -16,6 +18,7 @@ Build-Depends: debhelper-compat (= 13),
python3-fsspec,
python3-h5netcdf <!nocheck>,
python3-h5py,
+ python3-hdf5plugin,
python3-netcdf4 <!nocheck>,
python3-numcodecs,
python3-numpy,
@@ -75,12 +78,15 @@ Package: python3-kerchunk
Architecture: all
Depends: ${python3:Depends},
${misc:Depends}
-Recommends: python3-cfgrib,
+Recommends: hdf5-filter-plugin,
+ python3-cfgrib,
python3-cftime,
python3-h5py,
+ python3-hdf5plugin,
python3-scipy,
python3-xarray
-Suggests: python3-aiohttp,
+Suggests: bitshuffle,
+ python3-aiohttp,
python3-dask,
python3-netcdf4,
python3-s3fs
=====================================
debian/patches/0001-Fix-privacy-breaches.patch
=====================================
@@ -7,16 +7,17 @@ Forwarded: not-needed
docs/source/advanced.rst | 5 -----
docs/source/beyond.rst | 5 -----
docs/source/cases.rst | 5 -----
+ docs/source/code-of-conduct.rst | 5 -----
docs/source/contributing.rst | 5 -----
docs/source/detail.rst | 5 -----
- docs/source/index.rst | 5 -----
+ docs/source/index.rst | 9 ---------
docs/source/nonzarr.rst | 5 -----
docs/source/reference.rst | 5 -----
docs/source/reference_aggregation.rst | 5 -----
docs/source/spec.rst | 4 ----
docs/source/test_example.rst | 5 -----
docs/source/tutorial.rst | 5 -----
- 12 files changed, 59 deletions(-)
+ 13 files changed, 68 deletions(-)
diff --git a/docs/source/advanced.rst b/docs/source/advanced.rst
index 914920a..1df873e 100644
@@ -59,6 +60,19 @@ index 19fa6e9..1abf2ab 100644
-
- <script data-goatcounter="https://kerchunk.goatcounter.com/count"
- async src="//gc.zgo.at/count.js"></script>
+diff --git a/docs/source/code-of-conduct.rst b/docs/source/code-of-conduct.rst
+index 502c098..6ff7c3b 100644
+--- a/docs/source/code-of-conduct.rst
++++ b/docs/source/code-of-conduct.rst
+@@ -119,8 +119,3 @@ the `Contributor Covenant`_ and the `Django`_ project.
+ .. _BeeWare: https://beeware.org/community/behavior/code-of-conduct/
+ .. _Contributor Covenant: https://www.contributor-covenant.org/version/1/3/0/code-of-conduct/
+ .. _Django: https://www.djangoproject.com/conduct/reporting/
+-
+-.. raw:: html
+-
+- <script data-goatcounter="https://kerchunk.goatcounter.com/count"
+- async src="//gc.zgo.at/count.js"></script>
diff --git a/docs/source/contributing.rst b/docs/source/contributing.rst
index 725b131..730b326 100644
--- a/docs/source/contributing.rst
@@ -86,14 +100,18 @@ index 02885b6..c5147fd 100644
- <script data-goatcounter="https://kerchunk.goatcounter.com/count"
- async src="//gc.zgo.at/count.js"></script>
diff --git a/docs/source/index.rst b/docs/source/index.rst
-index 15a2589..f28562f 100644
+index 0b7044e..2bbb0a4 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
-@@ -78,8 +78,3 @@ Indices and tables
+@@ -79,12 +79,3 @@ Indices and tables
* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`
-
+-These docs pages collect anonymous tracking data using goatcounter, and the
+-dashboard is available to the public: https://kerchunk.goatcounter.com/ .
+-
+-
-.. raw:: html
-
- <script data-goatcounter="https://kerchunk.goatcounter.com/count"
=====================================
debian/patches/0002-No-internet.patch
=====================================
@@ -67,7 +67,7 @@ index 2c5387f..add7b5a 100644
"idx_url, storage_options",
[
diff --git a/tests/test_hdf.py b/tests/test_hdf.py
-index b40cdc0..a3b14bd 100644
+index afaf6be..5c7b5ca 100644
--- a/tests/test_hdf.py
+++ b/tests/test_hdf.py
@@ -21,6 +21,7 @@ from kerchunk.utils import fs_as_store, refs_as_fs, refs_as_store
=====================================
debian/rules
=====================================
@@ -8,16 +8,14 @@ export PYBUILD_NAME=kerchunk
SKIP_ARGS=not test_single_append_parquet \
and not test_zarr_combine
-# tests requiring hdf5 plugins
-# See https://github.com/fsspec/kerchunk/pull/575
+# tests requiring broken or unavailable hdf5 plugins
SKIP_ARGS +=\
-and not test_string_embed \
-and not test_string_pathlib \
and not test_string_null \
and not test_string_decode \
and not test_compound_string_null \
and not test_compound_string_encode \
-and not test_var
+and not test_var \
+and not test_malicious_chunks
export PYBUILD_TEST_ARGS=-v -m "not remotedata" -k "${SKIP_ARGS}" $(CURDIR)/tests
export PYBUILD_AFTER_TEST=${RM} -r {dir}/tests/__pycache__
=====================================
docs/source/code-of-conduct.rst
=====================================
@@ -0,0 +1,126 @@
+Code of Conduct
+===============
+
+All participants in the fsspec community are expected to adhere to a Code of Conduct.
+
+As contributors and maintainers of this project, and in the interest of
+fostering an open and welcoming community, we pledge to respect all people who
+contribute through reporting issues, posting feature requests, updating
+documentation, submitting pull requests or patches, and other activities.
+
+We are committed to making participation in this project a harassment-free
+experience for everyone, treating everyone as unique humans deserving of
+respect.
+
+Examples of unacceptable behaviour by participants include:
+
+- The use of sexualized language or imagery
+- Personal attacks
+- Trolling or insulting/derogatory comments
+- Public or private harassment
+- Publishing other's private information, such as physical or electronic
+ addresses, without explicit permission
+- Other unethical or unprofessional conduct
+
+Project maintainers have the right and responsibility to remove, edit, or
+reject comments, commits, code, wiki edits, issues, and other contributions
+that are not aligned to this Code of Conduct, or to ban temporarily or
+permanently any contributor for other behaviours that they deem inappropriate,
+threatening, offensive, or harmful.
+
+By adopting this Code of Conduct, project maintainers commit themselves
+to fairly and consistently applying these principles to every aspect of
+managing this project. Project maintainers who do not follow or enforce
+the Code of Conduct may be permanently removed from the project team.
+
+This code of conduct applies both within project spaces and in public
+spaces when an individual is representing the project or its community.
+
+If you feel the code of conduct has been violated, please report the
+incident to the fsspec core team.
+
+Reporting
+---------
+
+If you believe someone is violating theCode of Conduct we ask that you report it
+to the Project by emailing community at anaconda.com. All reports will be kept
+confidential. In some cases we may determine that a public statement will need
+to be made. If that's the case, the identities of all victims and reporters
+will remain confidential unless those individuals instruct us otherwise.
+If you believe anyone is in physical danger, please notify appropriate law
+enforcement first.
+
+In your report please include:
+
+- Your contact info
+- Names (real, nicknames, or pseudonyms) of any individuals involved.
+ If there were other witnesses besides you, please try to include them as well.
+- When and where the incident occurred. Please be as specific as possible.
+- Your account of what occurred. If there is a publicly available record
+ please include a link.
+- Any extra context you believe existed for the incident.
+- If you believe this incident is ongoing.
+- If you believe any member of the core team has a conflict of interest
+ in adjudicating the incident.
+- What, if any, corrective response you believe would be appropriate.
+- Any other information you believe we should have.
+
+Core team members are obligated to maintain confidentiality with regard
+to the reporter and details of an incident.
+
+What happens next?
+~~~~~~~~~~~~~~~~~~
+
+You will receive an email acknowledging receipt of your complaint.
+The core team will immediately meet to review the incident and determine:
+
+- What happened.
+- Whether this event constitutes a code of conduct violation.
+- Who the bad actor was.
+- Whether this is an ongoing situation, or if there is a threat to anyone's
+ physical safety.
+- If this is determined to be an ongoing incident or a threat to physical safety,
+ the working groups' immediate priority will be to protect everyone involved.
+
+If a member of the core team is one of the named parties, they will not be
+included in any discussions, and will not be provided with any confidential
+details from the reporter.
+
+If anyone on the core team believes they have a conflict of interest in
+adjudicating on a reported issue, they will inform the other core team
+members, and exempt themselves from any discussion about the issue.
+Following this declaration, they will not be provided with any confidential
+details from the reporter.
+
+Once the working group has a complete account of the events they will make a
+decision as to how to response. Responses may include:
+
+- Nothing (if we determine no violation occurred).
+- A private reprimand from the working group to the individual(s) involved.
+- A public reprimand.
+- An imposed vacation
+- A permanent or temporary ban from some or all spaces (GitHub repositories, etc.)
+- A request for a public or private apology.
+
+We'll respond within one week to the person who filed the report with either a
+resolution or an explanation of why the situation is not yet resolved.
+
+Once we've determined our final action, we'll contact the original reporter
+to let them know what action (if any) we'll be taking. We'll take into account
+feedback from the reporter on the appropriateness of our response, but we
+don't guarantee we'll act on it.
+
+Acknowledgement
+---------------
+
+This CoC is modified from the one by `BeeWare`_, which in turn refers to
+the `Contributor Covenant`_ and the `Django`_ project.
+
+.. _BeeWare: https://beeware.org/community/behavior/code-of-conduct/
+.. _Contributor Covenant: https://www.contributor-covenant.org/version/1/3/0/code-of-conduct/
+.. _Django: https://www.djangoproject.com/conduct/reporting/
+
+.. raw:: html
+
+ <script data-goatcounter="https://kerchunk.goatcounter.com/count"
+ async src="//gc.zgo.at/count.js"></script>
=====================================
docs/source/index.rst
=====================================
@@ -71,6 +71,7 @@ so that blocks from one or more files can be arranged into aggregate datasets ac
reference_aggregation
contributing
advanced
+ code-of-conduct
Indices and tables
==================
@@ -79,6 +80,10 @@ Indices and tables
* :ref:`modindex`
* :ref:`search`
+These docs pages collect anonymous tracking data using goatcounter, and the
+dashboard is available to the public: https://kerchunk.goatcounter.com/ .
+
+
.. raw:: html
<script data-goatcounter="https://kerchunk.goatcounter.com/count"
=====================================
kerchunk/grib2.py
=====================================
@@ -120,6 +120,22 @@ def _store_array(store, z, data, var, inline_threshold, offset, size, attr):
d.attrs.update(attr)
+def contains_valid_level(message_keys: Set) -> bool:
+ """Check if the given set of message_keys contain a valid level value.
+ Some types of level, like depthBelowLandLayer for GEFS grib files,
+ represent slices of levels by "topLevel" and "bottomLevel" rather
+ than a discrete level value described by "level".
+ see https://github.com/fsspec/kerchunk/issues/559
+
+ Args:
+ message_keys: Set of keys to evaluate
+
+ Returns:
+ True if message_keys contain a valid level value, False otherwise
+ """
+ return "level" in message_keys or "topLevel" in message_keys
+
+
def scan_grib(
url,
common=None,
@@ -247,7 +263,7 @@ def scan_grib(
_store_array(
store_dict, z, vals, varName, inline_threshold, offset, size, attrs
)
- if "typeOfLevel" in message_keys and "level" in message_keys:
+ if "typeOfLevel" in message_keys and contains_valid_level(message_keys):
name = m["typeOfLevel"]
coordinates.append(name)
# convert to numpy scalar, so that .tobytes can be used for inlining
=====================================
kerchunk/hdf.py
=====================================
@@ -1,15 +1,15 @@
-import base64
import io
import logging
import pathlib
-from typing import Union, BinaryIO
+from contextlib import ExitStack
+from typing import Union, Any, Dict, List, Tuple
import fsspec.core
from fsspec.implementations.reference import LazyReferenceMapper
import numpy as np
import zarr
import numcodecs
-
+from numcodecs.abc import Codec
from .codecs import FillStringsCodec
from .utils import (
_encode_for_JSON,
@@ -27,6 +27,11 @@ except ModuleNotFoundError: # pragma: no cover
"for more details."
)
+try:
+ import hdf5plugin # noqa
+except ModuleNotFoundError:
+ hdf5plugin = None
+
lggr = logging.getLogger("h5-to-zarr")
_HIDDEN_ATTRS = { # from h5netcdf.attrs
"REFERENCE_LIST",
@@ -40,12 +45,19 @@ _HIDDEN_ATTRS = { # from h5netcdf.attrs
}
+class Hdf5FeatureNotSupported(RuntimeError):
+ pass
+
+
class SingleHdf5ToZarr:
"""Translate the content of one HDF5 file into Zarr metadata.
HDF5 groups become Zarr groups. HDF5 datasets become Zarr arrays. Zarr array
chunks remain in the HDF5 file.
+ It is a good idea to call `.close()` on this instance after use, to free up
+ underlying resources in the file and hdf objects.
+
Parameters
----------
h5f : file-like or str
@@ -75,11 +87,14 @@ class SingleHdf5ToZarr:
This allows you to supply an fsspec.implementations.reference.LazyReferenceMapper
to write out parquet as the references get filled, or some other dictionary-like class
to customise how references get stored
+ unsupported_inline_threshold: int or None
+ Include chunks involving unsupported h5 features and smaller than this value directly
+ in the output. None will use "inline_threshold".
"""
def __init__(
self,
- h5f: "BinaryIO | str | h5py.File | h5py.Group",
+ h5f: "io.BinaryIO | str | h5py.File | h5py.Group",
url: str = None,
spec=1,
inline_threshold=500,
@@ -87,23 +102,29 @@ class SingleHdf5ToZarr:
error="warn",
vlen_encode="embed",
out=None,
+ unsupported_inline_threshold=None,
):
# Open HDF5 file in read mode...
lggr.debug(f"HDF5 file: {h5f}")
+ self._closers = ExitStack()
if isinstance(h5f, (pathlib.Path, str)):
fs, path = fsspec.core.url_to_fs(h5f, **(storage_options or {}))
- self.input_file = fs.open(path, "rb")
+ self.input_file = self._closers.enter_context(fs.open(path, "rb"))
url = h5f
- self._h5f = h5py.File(self.input_file, mode="r")
+ self._h5f = self._closers.enter_context(
+ h5py.File(self.input_file, mode="r")
+ )
elif isinstance(h5f, io.IOBase):
self.input_file = h5f
- self._h5f = h5py.File(self.input_file, mode="r")
+ self._h5f = self._closers.enter_context(
+ h5py.File(self.input_file, mode="r")
+ )
elif isinstance(h5f, (h5py.File, h5py.Group)):
# assume h5py object (File or group/dataset)
self._h5f = h5f
fs, path = fsspec.core.url_to_fs(url, **(storage_options or {}))
- self.input_file = fs.open(path, "rb")
+ self.input_file = self._closers.enter_context(fs.open(path, "rb"))
else:
raise ValueError("type of input `h5f` not recognised")
self.spec = spec
@@ -116,8 +137,14 @@ class SingleHdf5ToZarr:
self._zroot = zarr.group(store=self.store, zarr_format=2)
self._uri = url
self.error = error
+ if unsupported_inline_threshold is None:
+ unsupported_inline_threshold = inline_threshold or 100
+ self.unsupported_inline_threshold = unsupported_inline_threshold
lggr.debug(f"HDF5 file URI: {self._uri}")
+ def close(self):
+ self._closers.close()
+
def translate(self, preserve_linked_dsets=False):
"""Translate content of one HDF5 file into Zarr storage format.
@@ -144,7 +171,7 @@ class SingleHdf5ToZarr:
if preserve_linked_dsets:
if not has_visititems_links():
- raise RuntimeError(
+ raise Hdf5FeatureNotSupported(
"'preserve_linked_dsets' kwarg requires h5py 3.11.0 or later "
f"is installed, found {h5py.__version__}"
)
@@ -219,11 +246,11 @@ class SingleHdf5ToZarr:
def _decode_filters(self, h5obj: Union[h5py.Dataset, h5py.Group]):
if h5obj.scaleoffset:
- raise RuntimeError(
+ raise Hdf5FeatureNotSupported(
f"{h5obj.name} uses HDF5 scaleoffset filter - not supported by kerchunk"
)
if h5obj.compression in ("szip", "lzf"):
- raise RuntimeError(
+ raise Hdf5FeatureNotSupported(
f"{h5obj.name} uses szip or lzf compression - not supported by kerchunk"
)
filters = []
@@ -261,18 +288,18 @@ class SingleHdf5ToZarr:
elif str(filter_id) == "gzip":
filters.append(numcodecs.Zlib(level=properties))
elif str(filter_id) == "32004":
- raise RuntimeError(
+ raise Hdf5FeatureNotSupported(
f"{h5obj.name} uses lz4 compression - not supported by kerchunk"
)
elif str(filter_id) == "32008":
- raise RuntimeError(
+ raise Hdf5FeatureNotSupported(
f"{h5obj.name} uses bitshuffle compression - not supported by kerchunk"
)
elif str(filter_id) == "shuffle":
# already handled before this loop
pass
else:
- raise RuntimeError(
+ raise Hdf5FeatureNotSupported(
f"{h5obj.name} uses filter id {filter_id} with properties {properties},"
f" not supported by kerchunk."
)
@@ -287,7 +314,7 @@ class SingleHdf5ToZarr:
):
"""Produce Zarr metadata for all groups and datasets in the HDF5 file."""
try: # method must not raise exception
- kwargs = {"compressor": None}
+ kwargs: Dict[str, Any] = {"compressor": None}
if isinstance(h5obj, (h5py.SoftLink, h5py.HardLink)):
h5obj = self._h5f[name]
@@ -303,13 +330,35 @@ class SingleHdf5ToZarr:
lggr.debug(f"HDF5 compression: {h5obj.compression}")
if h5obj.id.get_create_plist().get_layout() == h5py.h5d.COMPACT:
# Only do if h5obj.nbytes < self.inline??
- kwargs["data"] = h5obj[:]
+ # kwargs["data"] = h5obj[:]
+ if h5obj.nbytes < self.inline:
+ kwargs["data"] = h5obj[:]
kwargs["filters"] = []
else:
- kwargs["filters"] = self._decode_filters(h5obj)
+ try:
+ kwargs["filters"] = self._decode_filters(h5obj)
+ except Hdf5FeatureNotSupported:
+ if h5obj.nbytes < self.unsupported_inline_threshold:
+ kwargs["data"] = _read_unsupported_direct(h5obj)
+ kwargs["filters"] = []
+ else:
+ raise
dt = None
# Get storage info of this HDF5 dataset...
- cinfo = self._storage_info(h5obj)
+ if "data" in kwargs:
+ cinfo = _NULL_CHUNK_INFO
+ else:
+ try:
+ cinfo = self._storage_info_and_adj_filters(
+ h5obj, kwargs["filters"]
+ )
+ except Hdf5FeatureNotSupported:
+ if h5obj.nbytes < self.unsupported_inline_threshold:
+ kwargs["data"] = _read_unsupported_direct(h5obj)
+ kwargs["filters"] = []
+ cinfo = _NULL_CHUNK_INFO
+ else:
+ raise
if "data" in kwargs:
fill = None
@@ -340,7 +389,7 @@ class SingleHdf5ToZarr:
for v in val
]
kwargs["data"] = out
- kwargs["filters"] = [numcodecs.JSON()]
+ kwargs["filters"] = [numcodecs.VLenUTF8()]
fill = None
elif self.vlen == "null":
dt = "O"
@@ -466,7 +515,7 @@ class SingleHdf5ToZarr:
)
dt = "O"
kwargs["data"] = data2
- kwargs["filters"] = [numcodecs.JSON()]
+ kwargs["filters"] = [numcodecs.VLenUTF8()]
fill = None
else:
raise NotImplementedError
@@ -481,6 +530,8 @@ class SingleHdf5ToZarr:
# Create a Zarr array equivalent to this HDF5 dataset.
data = kwargs.pop("data", None)
+ if (dt or h5obj.dtype) == object:
+ dt = "string"
za = self._zroot.require_array(
name=h5obj.name.lstrip("/"),
shape=h5obj.shape,
@@ -501,14 +552,23 @@ class SingleHdf5ToZarr:
try:
za[:] = data
except (ValueError, TypeError):
- self.store_dict[f"{za.path}/0"] = kwargs["filters"][0].encode(
- data
+ store_key = f"{za.path}/{'.'.join('0' * h5obj.ndim)}"
+ self.store_dict[store_key] = _filters_encode_data(
+ data, kwargs["filters"]
)
return
# Store chunk location metadata...
if cinfo:
for k, v in cinfo.items():
+ key = (
+ str.removeprefix(h5obj.name, "/")
+ + "/"
+ + ".".join(map(str, k))
+ )
+ if "data" in v:
+ self.store_dict[key] = v["data"]
+ continue
if h5obj.fletcher32:
logging.info("Discarding fletcher32 checksum")
v["size"] -= 4
@@ -525,11 +585,12 @@ class SingleHdf5ToZarr:
):
self.input_file.seek(v["offset"])
data = self.input_file.read(v["size"])
- try:
- # easiest way to test if data is ascii
- data.decode("ascii")
- except UnicodeDecodeError:
- data = b"base64:" + base64.b64encode(data)
+ # Removed the encoding, as will finally be encoded_for_JSON
+ # try:
+ # # easiest way to test if data is ascii
+ # data.decode("ascii")
+ # except UnicodeDecodeError:
+ # data = b"base64:" + base64.b64encode(data)
self.store_dict[key] = data
else:
@@ -597,7 +658,7 @@ class SingleHdf5ToZarr:
elif h5py.h5ds.is_scale(dset.id):
dims.append(dset.name[1:])
elif num_scales > 1:
- raise RuntimeError(
+ raise Hdf5FeatureNotSupported(
f"{dset.name}: {len(dset.dims[n])} "
f"dimension scales attached to dimension #{n}"
)
@@ -610,34 +671,42 @@ class SingleHdf5ToZarr:
dims.append(f"phony_dim_{n}")
return dims
- def _storage_info(self, dset: h5py.Dataset) -> dict:
+ def _storage_info_and_adj_filters(self, dset: h5py.Dataset, filters: list) -> dict:
"""Get storage information of an HDF5 dataset in the HDF5 file.
Storage information consists of file offset and size (length) for every
- chunk of the HDF5 dataset.
+ chunk of the HDF5 dataset. HDF5 dataset also configs for each chunk
+ which filters are skipped by `filter_mask` (mostly in the case where a chunk is small
+ or when write directly with low-level api like `H5Dwrite_chunk`/`DatasetID.write_direct_chunk`),
+ hence a filter will be cleared if the first chunk does not apply it.
+ `Hdf5FeatherNotSupported` will be raised if chucks have heterogeneous `filter_mask` and alien chunks
+ cannot be inlined.
Parameters
----------
dset : h5py.Dataset
HDF5 dataset for which to collect storage information.
+ filters: list
+ List of filters to apply to the HDF5 dataset. Will be modified in place
+ if some filters not applied.
Returns
-------
dict
HDF5 dataset storage information. Dict keys are chunk array offsets
as tuples. Dict values are pairs with chunk file offset and size
- integers.
+ integers, or data content if in need of inline for some reason.
"""
# Empty (null) dataset...
if dset.shape is None:
- return dict()
+ return _NULL_CHUNK_INFO
dsid = dset.id
if dset.chunks is None:
# Contiguous dataset...
if dsid.get_offset() is None:
# No data ever written...
- return dict()
+ return _NULL_CHUNK_INFO
else:
key = (0,) * (len(dset.shape) or 1)
return {
@@ -648,16 +717,59 @@ class SingleHdf5ToZarr:
num_chunks = dsid.get_num_chunks()
if num_chunks == 0:
# No data ever written...
- return dict()
+ return _NULL_CHUNK_INFO
# Go over all the dataset chunks...
stinfo = dict()
chunk_size = dset.chunks
+ filter_mask = None # type: None | int
def get_key(blob):
return tuple([a // b for a, b in zip(blob.chunk_offset, chunk_size)])
+ def filter_filters_by_mask(
+ filter_list: List[Codec], filter_mask_
+ ) -> List[Codec]:
+ # use [2:] to remove the heading `0b` and [::-1] to reverse the order
+ bin_mask = bin(filter_mask_)[2:][::-1]
+ filters_rest = [
+ ifilter
+ for ifilter, imask in zip(filter_list, bin_mask)
+ if imask == "0"
+ ]
+ return filters_rest
+
def store_chunk_info(blob):
+ nonlocal filter_mask
+ if filter_mask is None:
+ filter_mask = blob.filter_mask
+ elif filter_mask != blob.filter_mask:
+ if blob.size < self.unsupported_inline_threshold:
+ data_slc = tuple(
+ slice(dim_offset, min(dim_offset + dim_chunk, dim_bound))
+ for dim_offset, dim_chunk, dim_bound in zip(
+ blob.chunk_offset, chunk_size, dset.shape
+ )
+ )
+ data = _read_unsupported_direct(dset, data_slc)
+ if data.shape != chunk_size:
+ bg = np.full(
+ chunk_size, dset.fillvalue, dset.dtype, order="C"
+ )
+ bg[tuple(slice(0, d) for d in data.shape)] = data
+ data_flatten = bg.reshape(-1)
+ else:
+ data_flatten = data.reshape(-1)
+ encoded = _filters_encode_data(
+ data_flatten, filter_filters_by_mask(filters, filter_mask)
+ )
+ stinfo[get_key(blob)] = {"data": encoded}
+ return
+ else:
+ raise Hdf5FeatureNotSupported(
+ f"Dataset {dset.name} has heterogeneous `filter_mask` - "
+ f"not supported by kerchunk."
+ )
stinfo[get_key(blob)] = {"offset": blob.byte_offset, "size": blob.size}
has_chunk_iter = callable(getattr(dsid, "chunk_iter", None))
@@ -668,9 +780,31 @@ class SingleHdf5ToZarr:
for index in range(num_chunks):
store_chunk_info(dsid.get_chunk_info(index))
+ # In most cases, the `filter_mask` should be zero, which means that all filters are applied.
+ # If a filter is skipped, the corresponding bit in the mask fill be set 1.
+ if filter_mask is not None and filter_mask != 0:
+ filters[:] = filter_filters_by_mask(filters, filter_mask)
return stinfo
+def _read_unsupported_direct(dset, slc: Union[slice, Tuple[slice, ...]] = slice(None)):
+ try:
+ return dset[slc]
+ # As what I googled, you might get OSError/ValueError/RuntimeError
+ # when a filter in need was not registered. I met an `OSError` with
+ # bizarre error message in my test without `hdf5plugin` imported.
+ # Just simply catch all exceptions, as we will rethrow it anyway.
+ except Exception:
+ if hdf5plugin is None:
+ import warnings
+
+ warnings.warn(
+ "Attempt to directly read h5-dataset via `h5py` failed. It is recommended to "
+ "install `hdf5plugin` so that we can register extra filters, and then try again."
+ )
+ raise
+
+
def _simple_type(x):
if isinstance(x, bytes):
return x.decode()
@@ -704,3 +838,12 @@ def _is_netcdf_variable(dataset: h5py.Dataset):
def has_visititems_links():
return hasattr(h5py.Group, "visititems_links")
+
+
+def _filters_encode_data(data, filters: List[Codec]):
+ for ifilter in filters:
+ data = ifilter.encode(data)
+ return bytes(data)
+
+
+_NULL_CHUNK_INFO = dict()
=====================================
kerchunk/utils.py
=====================================
@@ -217,6 +217,14 @@ def encode_fill_value(v: Any, dtype: np.dtype, compressor: Any = None) -> Any:
# early out
if v is None:
return v
+
+ # Assure that v is a numpy array of the appropriate dtype
+ v = np.asanyarray(v, dtype=dtype)
+ # If v is 1D and only has 1 element, squeeze it to 0-D ndarray.
+ # When v is not 0-D, calling int/float/bool on v will fail
+ # We still need an ndarray in case dtype is a date
+ v = v.squeeze()
+
if dtype.kind == "V" and dtype.hasobject:
if compressor is None:
raise ValueError("missing compressor for object array")
=====================================
tests/gen_hdf5.py
=====================================
@@ -13,5 +13,8 @@ compressors = dict(
for c in compressors:
f = h5py.File(f"hdf5_compression_{c}.h5", "w")
- f.create_dataset("data", data=numpy.arange(100), **compressors[c])
+ # Explicit set int64, to keep compatible with Numpy 1.x in Windows
+ f.create_dataset(
+ "data", data=numpy.arange(100, dtype=numpy.int64), **compressors[c]
+ )
f.close()
=====================================
tests/gen_mini_hdf5.py
=====================================
@@ -0,0 +1,71 @@
+import numpy
+import h5py
+import hdf5plugin
+import ujson
+import numcodecs
+import zarr
+from fsspec.implementations.reference import ReferenceFileSystem
+
+
+def _dict_add(a, b):
+ c = a.copy()
+ c.update(b)
+ return c
+
+
+compressors = dict(
+ zstd=hdf5plugin.Zstd(),
+ shuffle_zstd=_dict_add(
+ dict(shuffle=True), hdf5plugin.Zstd()
+ ), # Test for two filters
+ blosc=hdf5plugin.Blosc(),
+ shuffle_blosc=_dict_add(dict(shuffle=True), hdf5plugin.Blosc()),
+ lz4=hdf5plugin.LZ4(),
+)
+
+for c in compressors:
+ f = h5py.File(f"hdf5_mini_{c}.h5", "w")
+ f.create_dataset("data", (3, 2), dtype=numpy.int32, **compressors[c]).write_direct(
+ numpy.arange(6, dtype=numpy.int32).reshape((3, 2))
+ )
+ f.close()
+
+f = h5py.File(f"hdf5_mali_chunk.h5", "w")
+f.create_dataset("data", (8,), dtype=numpy.int32, chunks=(7,), **hdf5plugin.Zstd())
+f["data"][0:7] = numpy.arange(7, dtype=numpy.int32)
+f["data"].id.write_direct_chunk((7,), numpy.array([7] + [0] * 6, dtype=numpy.int32), 1)
+print(f["data"][:])
+f.close()
+
+f = h5py.File(f"hdf5_mali_chunk2.h5", "w")
+f.create_dataset(
+ "data", (8,), dtype=numpy.int32, chunks=(3,), shuffle=True, **hdf5plugin.Zstd()
+)
+f["data"].id.write_direct_chunk(
+ (0,), numcodecs.Shuffle(4).encode(numpy.array([0, 1, 2], dtype=numpy.int32)), 2
+)
+f["data"].id.write_direct_chunk((3,), numpy.array([3, 4, 5], dtype=numpy.int32), 3)
+f["data"][6:] = numpy.array(
+ [
+ 6,
+ 7,
+ ],
+ dtype=numpy.int32,
+)
+print(f["data"][:])
+f.close()
+
+# from kerchunk.hdf import SingleHdf5ToZarr
+# for c in compressors:
+# # if c=='lz4':
+# # continue
+# with open(f'hdf5_mini_{c}.h5', 'rb') as f:
+# print(ujson.dumps(SingleHdf5ToZarr(f, None, inline_threshold=0, unsupported_inline_threshold=1).translate(), indent=4))
+# with open(f'hdf5_mali_chunk.h5', 'rb') as f:
+# ref = SingleHdf5ToZarr(f, None).translate()
+# print(ujson.dumps(ref, indent=4))
+# print(zarr.group(ReferenceFileSystem(ref, target=f'hdf5_mali_chunk.h5').get_mapper())['data'][:])
+# with open(f'hdf5_mali_chunk2.h5', 'rb') as f:
+# ref = SingleHdf5ToZarr(f, None).translate()
+# print(ujson.dumps(ref, indent=4))
+# print(zarr.group(ReferenceFileSystem(ref, target=f'hdf5_mali_chunk2.h5').get_mapper())['data'][:])
=====================================
tests/grib_idx_fixtures/gec00_20250101_00_soilw/gec00_20250101_00_soilw.grib
=====================================
Binary files /dev/null and b/tests/grib_idx_fixtures/gec00_20250101_00_soilw/gec00_20250101_00_soilw.grib differ
=====================================
tests/hdf5_mali_chunk.h5
=====================================
Binary files /dev/null and b/tests/hdf5_mali_chunk.h5 differ
=====================================
tests/hdf5_mali_chunk2.h5
=====================================
Binary files /dev/null and b/tests/hdf5_mali_chunk2.h5 differ
=====================================
tests/hdf5_mini_blosc.h5
=====================================
Binary files /dev/null and b/tests/hdf5_mini_blosc.h5 differ
=====================================
tests/hdf5_mini_lz4.h5
=====================================
Binary files /dev/null and b/tests/hdf5_mini_lz4.h5 differ
=====================================
tests/hdf5_mini_shuffle_blosc.h5
=====================================
Binary files /dev/null and b/tests/hdf5_mini_shuffle_blosc.h5 differ
=====================================
tests/hdf5_mini_shuffle_zstd.h5
=====================================
Binary files /dev/null and b/tests/hdf5_mini_shuffle_zstd.h5 differ
=====================================
tests/hdf5_mini_zstd.h5
=====================================
Binary files /dev/null and b/tests/hdf5_mini_zstd.h5 differ
=====================================
tests/test_hdf.py
=====================================
@@ -2,7 +2,7 @@ import asyncio
import fsspec
import json
import os.path as osp
-
+import hdf5plugin
import zarr.core
import zarr.core.buffer
import zarr.core.group
@@ -195,7 +195,7 @@ txt = "the change of water into water vapour"
def test_string_embed():
fn = osp.join(here, "vlen.h5")
- h = kerchunk.hdf.SingleHdf5ToZarr(fn, fn, vlen_encode="embed", error="pdb")
+ h = kerchunk.hdf.SingleHdf5ToZarr(fn, fn, vlen_encode="embed")
out = h.translate()
localfs = AsyncFileSystemWrapper(fsspec.filesystem("file"))
@@ -203,7 +203,7 @@ def test_string_embed():
# assert txt in fs.references["vlen_str/0"]
store = fs_as_store(fs)
z = zarr.open(store, zarr_format=2)
- assert z["vlen_str"].dtype == "O"
+ assert "String" in str(z["vlen_str"])
assert z["vlen_str"][0] == txt
assert (z["vlen_str"][1:] == "").all()
@@ -218,7 +218,7 @@ def test_string_pathlib():
fs = fsspec.filesystem("reference", fo=out)
assert txt in fs.references["vlen_str/0"]
z = zarr.open(fs_as_store(fs))
- assert z["vlen_str"].dtype == "O"
+ assert "String" in str(z["vlen_str"])
assert z["vlen_str"][0] == txt
assert (z["vlen_str"][1:] == "").all()
@@ -299,9 +299,10 @@ def test_compound_string_encode():
fn = osp.join(here, "vlen2.h5")
with open(fn, "rb") as f:
h = kerchunk.hdf.SingleHdf5ToZarr(
- f, fn, vlen_encode="encode", inline_threshold=0
+ f, fn, vlen_encode="encode", inline_threshold=0, error="raise"
)
out = h.translate()
+ print(json.dumps(out, indent=4))
localfs = AsyncFileSystemWrapper(fsspec.filesystem("file"))
store = refs_as_store(out, fs=localfs)
z = zarr.open(store, zarr_format=2)
@@ -328,7 +329,9 @@ def test_compress():
files = glob.glob(osp.join(here, "hdf5_compression_*.h5"))
for f in files:
- h = kerchunk.hdf.SingleHdf5ToZarr(f, error="raise")
+ h = kerchunk.hdf.SingleHdf5ToZarr(
+ f, error="raise", unsupported_inline_threshold=800
+ )
if "compression_lz4" in f or "compression_bitshuffle" in f:
with pytest.raises(RuntimeError):
h.translate()
@@ -339,6 +342,17 @@ def test_compress():
g = zarr.open(store, zarr_format=2)
assert np.mean(g["data"]) == 49.5
+ for f in files:
+ # Small chunk of unsupported can be inlined now
+ h = kerchunk.hdf.SingleHdf5ToZarr(
+ f, error="raise", unsupported_inline_threshold=801
+ )
+ out = h.translate()
+ localfs = AsyncFileSystemWrapper(fsspec.filesystem("file"))
+ store = refs_as_store(out, fs=localfs)
+ g = zarr.open(store, zarr_format=2)
+ assert np.mean(g["data"]) == 49.5
+
# def test_embed():
# fn = osp.join(here, "NEONDSTowerTemperatureData.hdf5")
@@ -393,3 +407,44 @@ def test_translate_links():
for key in z[f"{dset}_{link}"].attrs.keys():
if key not in kerchunk.hdf._HIDDEN_ATTRS and key != "_ARRAY_DIMENSIONS":
assert z[f"{dset}_{link}"].attrs[key] == z[dset].attrs[key]
+
+
+def test_small_chunks():
+ from fsspec.implementations.reference import ReferenceFileSystem
+
+ suffixes = ["blosc", "shuffle_zstd", "zstd", "lz4", "shuffle_blosc"]
+ for suffix in suffixes:
+ fname = osp.join(here, f"hdf5_mini_{suffix}.h5")
+ with open(fname, "rb") as f:
+ ref = kerchunk.hdf.SingleHdf5ToZarr(f, None).translate()
+ store = ReferenceFileSystem(ref, target=fname, asynchronous=True).get_mapper()
+ data = zarr.group(store)["data"][:]
+ assert (data == np.arange(6, dtype=np.int32).reshape((3, 2))).all()
+
+ if suffix == "lz4":
+ with pytest.raises(RuntimeError):
+ with open(fname, "rb") as f:
+ ref = kerchunk.hdf.SingleHdf5ToZarr(
+ f, None, unsupported_inline_threshold=0, error="raise"
+ ).translate()
+ continue
+ with open(fname, "rb") as f:
+ ref = kerchunk.hdf.SingleHdf5ToZarr(
+ f, None, unsupported_inline_threshold=0
+ ).translate()
+ store2 = ReferenceFileSystem(ref, target=fname).get_mapper()
+ data2 = zarr.group(store2)["data"][:]
+ assert (data2 == np.arange(6, dtype=np.int32).reshape((3, 2))).all()
+
+
+def test_malicious_chunks():
+ from fsspec.implementations.reference import ReferenceFileSystem
+
+ suffixes = ["", "2"]
+ for suffix in suffixes:
+ fname = osp.join(here, f"hdf5_mali_chunk{suffix}.h5")
+ with open(fname, "rb") as f:
+ ref = kerchunk.hdf.SingleHdf5ToZarr(f, None).translate()
+ store = ReferenceFileSystem(ref, target=fname, asynchronous=True).get_mapper()
+ data = zarr.group(store)["data"][:]
+ assert (data == np.arange(8, dtype=np.int32)).all()
=====================================
tests/test_utils.py
=====================================
@@ -176,3 +176,10 @@ def test_deflate_zip_archive(m):
fs = fsspec.filesystem("reference", fo=refs2)
assert dec.decode(fs.cat("b")) == data
+
+
+def test_encode_fill_value():
+ assert kerchunk.utils.encode_fill_value(np.array([9999]), np.dtype("int")) == 9999
+ assert kerchunk.utils.encode_fill_value(np.array(9999), np.dtype("int")) == 9999
+ assert kerchunk.utils.encode_fill_value([9999], np.dtype("int")) == 9999
+ assert kerchunk.utils.encode_fill_value(9999, np.dtype("int")) == 9999
View it on GitLab: https://salsa.debian.org/debian-gis-team/kerchunk/-/compare/c9184d7745a7aa278981b33b56a10ee6ec026935...e5b46fd9cb0714dfcc3c68db09e3e7e9fdfd2bd6
--
View it on GitLab: https://salsa.debian.org/debian-gis-team/kerchunk/-/compare/c9184d7745a7aa278981b33b56a10ee6ec026935...e5b46fd9cb0714dfcc3c68db09e3e7e9fdfd2bd6
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/pkg-grass-devel/attachments/20260404/d6569e21/attachment-0001.htm>
More information about the Pkg-grass-devel
mailing list