[Git][debian-gis-team/kerchunk][upstream] New upstream version 0.2.10

Antonio Valentino (@antonio.valentino) gitlab at salsa.debian.org
Sat Apr 4 09:45:18 BST 2026



Antonio Valentino pushed to branch upstream at Debian GIS Project / kerchunk


Commits:
6aa2e938 by Antonio Valentino at 2026-04-03T17:07:03+00:00
New upstream version 0.2.10
- - - - -


19 changed files:

- ci/environment-py311.yml
- ci/environment-py312.yml
- + docs/source/code-of-conduct.rst
- docs/source/index.rst
- kerchunk/grib2.py
- kerchunk/hdf.py
- kerchunk/utils.py
- tests/gen_hdf5.py
- + tests/gen_mini_hdf5.py
- + tests/grib_idx_fixtures/gec00_20250101_00_soilw/gec00_20250101_00_soilw.grib
- + tests/hdf5_mali_chunk.h5
- + tests/hdf5_mali_chunk2.h5
- + tests/hdf5_mini_blosc.h5
- + tests/hdf5_mini_lz4.h5
- + tests/hdf5_mini_shuffle_blosc.h5
- + tests/hdf5_mini_shuffle_zstd.h5
- + tests/hdf5_mini_zstd.h5
- tests/test_hdf.py
- tests/test_utils.py


Changes:

=====================================
ci/environment-py311.yml
=====================================
@@ -9,6 +9,7 @@ dependencies:
   - xarray>=2024.10.0
   - h5netcdf
   - h5py
+  - hdf5plugin
   - pandas
   - cfgrib
   # Temporary workaround for #508


=====================================
ci/environment-py312.yml
=====================================
@@ -9,6 +9,7 @@ dependencies:
   - xarray>=2024.10.0
   - h5netcdf
   - h5py
+  - hdf5plugin
   - pandas
   - cfgrib
   # Temporary workaround for #508


=====================================
docs/source/code-of-conduct.rst
=====================================
@@ -0,0 +1,126 @@
+Code of Conduct
+===============
+
+All participants in the fsspec community are expected to adhere to a Code of Conduct.
+
+As contributors and maintainers of this project, and in the interest of
+fostering an open and welcoming community, we pledge to respect all people who
+contribute through reporting issues, posting feature requests, updating
+documentation, submitting pull requests or patches, and other activities.
+
+We are committed to making participation in this project a harassment-free
+experience for everyone, treating everyone as unique humans deserving of
+respect.
+
+Examples of unacceptable behaviour by participants include:
+
+- The use of sexualized language or imagery
+- Personal attacks
+- Trolling or insulting/derogatory comments
+- Public or private harassment
+- Publishing other's private information, such as physical or electronic
+  addresses, without explicit permission
+- Other unethical or unprofessional conduct
+
+Project maintainers have the right and responsibility to remove, edit, or
+reject comments, commits, code, wiki edits, issues, and other contributions
+that are not aligned to this Code of Conduct, or to ban temporarily or
+permanently any contributor for other behaviours that they deem inappropriate,
+threatening, offensive, or harmful.
+
+By adopting this Code of Conduct, project maintainers commit themselves
+to fairly and consistently applying these principles to every aspect of
+managing this project. Project maintainers who do not follow or enforce
+the Code of Conduct may be permanently removed from the project team.
+
+This code of conduct applies both within project spaces and in public
+spaces when an individual is representing the project or its community.
+
+If you feel the code of conduct has been violated, please report the
+incident to the fsspec core team.
+
+Reporting
+---------
+
+If you believe someone is violating theCode of Conduct we ask that you report it
+to the  Project by emailing community at anaconda.com. All reports will be kept
+confidential. In some cases we may determine that a public statement will need
+to be made. If that's the case, the identities of all victims and reporters
+will remain confidential unless those individuals instruct us otherwise.
+If you believe anyone is in physical danger, please notify appropriate law
+enforcement first.
+
+In your report please include:
+
+- Your contact info
+- Names (real, nicknames, or pseudonyms) of any individuals involved.
+  If there were other witnesses besides you, please try to include them as well.
+- When and where the incident occurred. Please be as specific as possible.
+- Your account of what occurred. If there is a publicly available record
+  please include a link.
+- Any extra context you believe existed for the incident.
+- If you believe this incident is ongoing.
+- If you believe any member of the core team has a conflict of interest
+  in adjudicating the incident.
+- What, if any, corrective response you believe would be appropriate.
+- Any other information you believe we should have.
+
+Core team members are obligated to maintain confidentiality with regard
+to the reporter and details of an incident.
+
+What happens next?
+~~~~~~~~~~~~~~~~~~
+
+You will receive an email acknowledging receipt of your complaint.
+The core team will immediately meet to review the incident and determine:
+
+- What happened.
+- Whether this event constitutes a code of conduct violation.
+- Who the bad actor was.
+- Whether this is an ongoing situation, or if there is a threat to anyone's
+  physical safety.
+- If this is determined to be an ongoing incident or a threat to physical safety,
+  the working groups' immediate priority will be to protect everyone involved.
+
+If a member of the core team is one of the named parties, they will not be
+included in any discussions, and will not be provided with any confidential
+details from the reporter.
+
+If anyone on the core team believes they have a conflict of interest in
+adjudicating on a reported issue, they will inform the other core team
+members, and exempt themselves from any discussion about the issue.
+Following this declaration, they will not be provided with any confidential
+details from the reporter.
+
+Once the working group has a complete account of the events they will make a
+decision as to how to response. Responses may include:
+
+- Nothing (if we determine no violation occurred).
+- A private reprimand from the working group to the individual(s) involved.
+- A public reprimand.
+- An imposed vacation
+- A permanent or temporary ban from some or all spaces (GitHub repositories, etc.)
+- A request for a public or private apology.
+
+We'll respond within one week to the person who filed the report with either a
+resolution or an explanation of why the situation is not yet resolved.
+
+Once we've determined our final action, we'll contact the original reporter
+to let them know what action (if any) we'll be taking. We'll take into account
+feedback from the reporter on the appropriateness of our response, but we
+don't guarantee we'll act on it.
+
+Acknowledgement
+---------------
+
+This CoC is modified from the one by `BeeWare`_, which in turn refers to
+the `Contributor Covenant`_ and the `Django`_ project.
+
+.. _BeeWare: https://beeware.org/community/behavior/code-of-conduct/
+.. _Contributor Covenant: https://www.contributor-covenant.org/version/1/3/0/code-of-conduct/
+.. _Django: https://www.djangoproject.com/conduct/reporting/
+
+.. raw:: html
+
+    <script data-goatcounter="https://kerchunk.goatcounter.com/count"
+        async src="//gc.zgo.at/count.js"></script>


=====================================
docs/source/index.rst
=====================================
@@ -71,6 +71,7 @@ so that blocks from one or more files can be arranged into aggregate datasets ac
    reference_aggregation
    contributing
    advanced
+   code-of-conduct
 
 Indices and tables
 ==================
@@ -79,6 +80,10 @@ Indices and tables
 * :ref:`modindex`
 * :ref:`search`
 
+These docs pages collect anonymous tracking data using goatcounter, and the
+dashboard is available to the public: https://kerchunk.goatcounter.com/ .
+
+
 .. raw:: html
 
     <script data-goatcounter="https://kerchunk.goatcounter.com/count"


=====================================
kerchunk/grib2.py
=====================================
@@ -120,6 +120,22 @@ def _store_array(store, z, data, var, inline_threshold, offset, size, attr):
     d.attrs.update(attr)
 
 
+def contains_valid_level(message_keys: Set) -> bool:
+    """Check if the given set of message_keys contain a valid level value.
+    Some types of level, like depthBelowLandLayer for GEFS grib files,
+    represent slices of levels by "topLevel" and "bottomLevel" rather
+    than a discrete level value described by "level".
+    see https://github.com/fsspec/kerchunk/issues/559
+
+    Args:
+        message_keys: Set of keys to evaluate
+
+    Returns:
+        True if message_keys contain a valid level value, False otherwise
+    """
+    return "level" in message_keys or "topLevel" in message_keys
+
+
 def scan_grib(
     url,
     common=None,
@@ -247,7 +263,7 @@ def scan_grib(
             _store_array(
                 store_dict, z, vals, varName, inline_threshold, offset, size, attrs
             )
-            if "typeOfLevel" in message_keys and "level" in message_keys:
+            if "typeOfLevel" in message_keys and contains_valid_level(message_keys):
                 name = m["typeOfLevel"]
                 coordinates.append(name)
                 # convert to numpy scalar, so that .tobytes can be used for inlining


=====================================
kerchunk/hdf.py
=====================================
@@ -1,15 +1,15 @@
-import base64
 import io
 import logging
 import pathlib
-from typing import Union, BinaryIO
+from contextlib import ExitStack
+from typing import Union, Any, Dict, List, Tuple
 
 import fsspec.core
 from fsspec.implementations.reference import LazyReferenceMapper
 import numpy as np
 import zarr
 import numcodecs
-
+from numcodecs.abc import Codec
 from .codecs import FillStringsCodec
 from .utils import (
     _encode_for_JSON,
@@ -27,6 +27,11 @@ except ModuleNotFoundError:  # pragma: no cover
         "for more details."
     )
 
+try:
+    import hdf5plugin  # noqa
+except ModuleNotFoundError:
+    hdf5plugin = None
+
 lggr = logging.getLogger("h5-to-zarr")
 _HIDDEN_ATTRS = {  # from h5netcdf.attrs
     "REFERENCE_LIST",
@@ -40,12 +45,19 @@ _HIDDEN_ATTRS = {  # from h5netcdf.attrs
 }
 
 
+class Hdf5FeatureNotSupported(RuntimeError):
+    pass
+
+
 class SingleHdf5ToZarr:
     """Translate the content of one HDF5 file into Zarr metadata.
 
     HDF5 groups become Zarr groups. HDF5 datasets become Zarr arrays. Zarr array
     chunks remain in the HDF5 file.
 
+    It is a good idea to call `.close()` on this instance after use, to free up
+    underlying resources in the file and hdf objects.
+
     Parameters
     ----------
     h5f : file-like or str
@@ -75,11 +87,14 @@ class SingleHdf5ToZarr:
         This allows you to supply an fsspec.implementations.reference.LazyReferenceMapper
         to write out parquet as the references get filled, or some other dictionary-like class
         to customise how references get stored
+    unsupported_inline_threshold: int or None
+        Include chunks involving unsupported h5 features and smaller than this value directly
+        in the output. None will use "inline_threshold".
     """
 
     def __init__(
         self,
-        h5f: "BinaryIO | str | h5py.File | h5py.Group",
+        h5f: "io.BinaryIO | str | h5py.File | h5py.Group",
         url: str = None,
         spec=1,
         inline_threshold=500,
@@ -87,23 +102,29 @@ class SingleHdf5ToZarr:
         error="warn",
         vlen_encode="embed",
         out=None,
+        unsupported_inline_threshold=None,
     ):
 
         # Open HDF5 file in read mode...
         lggr.debug(f"HDF5 file: {h5f}")
+        self._closers = ExitStack()
         if isinstance(h5f, (pathlib.Path, str)):
             fs, path = fsspec.core.url_to_fs(h5f, **(storage_options or {}))
-            self.input_file = fs.open(path, "rb")
+            self.input_file = self._closers.enter_context(fs.open(path, "rb"))
             url = h5f
-            self._h5f = h5py.File(self.input_file, mode="r")
+            self._h5f = self._closers.enter_context(
+                h5py.File(self.input_file, mode="r")
+            )
         elif isinstance(h5f, io.IOBase):
             self.input_file = h5f
-            self._h5f = h5py.File(self.input_file, mode="r")
+            self._h5f = self._closers.enter_context(
+                h5py.File(self.input_file, mode="r")
+            )
         elif isinstance(h5f, (h5py.File, h5py.Group)):
             # assume h5py object (File or group/dataset)
             self._h5f = h5f
             fs, path = fsspec.core.url_to_fs(url, **(storage_options or {}))
-            self.input_file = fs.open(path, "rb")
+            self.input_file = self._closers.enter_context(fs.open(path, "rb"))
         else:
             raise ValueError("type of input `h5f` not recognised")
         self.spec = spec
@@ -116,8 +137,14 @@ class SingleHdf5ToZarr:
         self._zroot = zarr.group(store=self.store, zarr_format=2)
         self._uri = url
         self.error = error
+        if unsupported_inline_threshold is None:
+            unsupported_inline_threshold = inline_threshold or 100
+        self.unsupported_inline_threshold = unsupported_inline_threshold
         lggr.debug(f"HDF5 file URI: {self._uri}")
 
+    def close(self):
+        self._closers.close()
+
     def translate(self, preserve_linked_dsets=False):
         """Translate content of one HDF5 file into Zarr storage format.
 
@@ -144,7 +171,7 @@ class SingleHdf5ToZarr:
 
         if preserve_linked_dsets:
             if not has_visititems_links():
-                raise RuntimeError(
+                raise Hdf5FeatureNotSupported(
                     "'preserve_linked_dsets' kwarg requires h5py 3.11.0 or later "
                     f"is installed, found {h5py.__version__}"
                 )
@@ -219,11 +246,11 @@ class SingleHdf5ToZarr:
 
     def _decode_filters(self, h5obj: Union[h5py.Dataset, h5py.Group]):
         if h5obj.scaleoffset:
-            raise RuntimeError(
+            raise Hdf5FeatureNotSupported(
                 f"{h5obj.name} uses HDF5 scaleoffset filter - not supported by kerchunk"
             )
         if h5obj.compression in ("szip", "lzf"):
-            raise RuntimeError(
+            raise Hdf5FeatureNotSupported(
                 f"{h5obj.name} uses szip or lzf compression - not supported by kerchunk"
             )
         filters = []
@@ -261,18 +288,18 @@ class SingleHdf5ToZarr:
             elif str(filter_id) == "gzip":
                 filters.append(numcodecs.Zlib(level=properties))
             elif str(filter_id) == "32004":
-                raise RuntimeError(
+                raise Hdf5FeatureNotSupported(
                     f"{h5obj.name} uses lz4 compression - not supported by kerchunk"
                 )
             elif str(filter_id) == "32008":
-                raise RuntimeError(
+                raise Hdf5FeatureNotSupported(
                     f"{h5obj.name} uses bitshuffle compression - not supported by kerchunk"
                 )
             elif str(filter_id) == "shuffle":
                 # already handled before this loop
                 pass
             else:
-                raise RuntimeError(
+                raise Hdf5FeatureNotSupported(
                     f"{h5obj.name} uses filter id {filter_id} with properties {properties},"
                     f" not supported by kerchunk."
                 )
@@ -287,7 +314,7 @@ class SingleHdf5ToZarr:
     ):
         """Produce Zarr metadata for all groups and datasets in the HDF5 file."""
         try:  # method must not raise exception
-            kwargs = {"compressor": None}
+            kwargs: Dict[str, Any] = {"compressor": None}
 
             if isinstance(h5obj, (h5py.SoftLink, h5py.HardLink)):
                 h5obj = self._h5f[name]
@@ -303,13 +330,35 @@ class SingleHdf5ToZarr:
                 lggr.debug(f"HDF5 compression: {h5obj.compression}")
                 if h5obj.id.get_create_plist().get_layout() == h5py.h5d.COMPACT:
                     # Only do if h5obj.nbytes < self.inline??
-                    kwargs["data"] = h5obj[:]
+                    # kwargs["data"] = h5obj[:]
+                    if h5obj.nbytes < self.inline:
+                        kwargs["data"] = h5obj[:]
                     kwargs["filters"] = []
                 else:
-                    kwargs["filters"] = self._decode_filters(h5obj)
+                    try:
+                        kwargs["filters"] = self._decode_filters(h5obj)
+                    except Hdf5FeatureNotSupported:
+                        if h5obj.nbytes < self.unsupported_inline_threshold:
+                            kwargs["data"] = _read_unsupported_direct(h5obj)
+                            kwargs["filters"] = []
+                        else:
+                            raise
                 dt = None
                 # Get storage info of this HDF5 dataset...
-                cinfo = self._storage_info(h5obj)
+                if "data" in kwargs:
+                    cinfo = _NULL_CHUNK_INFO
+                else:
+                    try:
+                        cinfo = self._storage_info_and_adj_filters(
+                            h5obj, kwargs["filters"]
+                        )
+                    except Hdf5FeatureNotSupported:
+                        if h5obj.nbytes < self.unsupported_inline_threshold:
+                            kwargs["data"] = _read_unsupported_direct(h5obj)
+                            kwargs["filters"] = []
+                            cinfo = _NULL_CHUNK_INFO
+                        else:
+                            raise
 
                 if "data" in kwargs:
                     fill = None
@@ -340,7 +389,7 @@ class SingleHdf5ToZarr:
                                             for v in val
                                         ]
                             kwargs["data"] = out
-                            kwargs["filters"] = [numcodecs.JSON()]
+                            kwargs["filters"] = [numcodecs.VLenUTF8()]
                             fill = None
                         elif self.vlen == "null":
                             dt = "O"
@@ -466,7 +515,7 @@ class SingleHdf5ToZarr:
                                 )
                             dt = "O"
                             kwargs["data"] = data2
-                            kwargs["filters"] = [numcodecs.JSON()]
+                            kwargs["filters"] = [numcodecs.VLenUTF8()]
                             fill = None
                         else:
                             raise NotImplementedError
@@ -481,6 +530,8 @@ class SingleHdf5ToZarr:
 
                 # Create a Zarr array equivalent to this HDF5 dataset.
                 data = kwargs.pop("data", None)
+                if (dt or h5obj.dtype) == object:
+                    dt = "string"
                 za = self._zroot.require_array(
                     name=h5obj.name.lstrip("/"),
                     shape=h5obj.shape,
@@ -501,14 +552,23 @@ class SingleHdf5ToZarr:
                     try:
                         za[:] = data
                     except (ValueError, TypeError):
-                        self.store_dict[f"{za.path}/0"] = kwargs["filters"][0].encode(
-                            data
+                        store_key = f"{za.path}/{'.'.join('0' * h5obj.ndim)}"
+                        self.store_dict[store_key] = _filters_encode_data(
+                            data, kwargs["filters"]
                         )
                     return
 
                 # Store chunk location metadata...
                 if cinfo:
                     for k, v in cinfo.items():
+                        key = (
+                            str.removeprefix(h5obj.name, "/")
+                            + "/"
+                            + ".".join(map(str, k))
+                        )
+                        if "data" in v:
+                            self.store_dict[key] = v["data"]
+                            continue
                         if h5obj.fletcher32:
                             logging.info("Discarding fletcher32 checksum")
                             v["size"] -= 4
@@ -525,11 +585,12 @@ class SingleHdf5ToZarr:
                         ):
                             self.input_file.seek(v["offset"])
                             data = self.input_file.read(v["size"])
-                            try:
-                                # easiest way to test if data is ascii
-                                data.decode("ascii")
-                            except UnicodeDecodeError:
-                                data = b"base64:" + base64.b64encode(data)
+                            # Removed the encoding, as will finally be encoded_for_JSON
+                            # try:
+                            #     # easiest way to test if data is ascii
+                            #     data.decode("ascii")
+                            # except UnicodeDecodeError:
+                            #     data = b"base64:" + base64.b64encode(data)
 
                             self.store_dict[key] = data
                         else:
@@ -597,7 +658,7 @@ class SingleHdf5ToZarr:
                 elif h5py.h5ds.is_scale(dset.id):
                     dims.append(dset.name[1:])
                 elif num_scales > 1:
-                    raise RuntimeError(
+                    raise Hdf5FeatureNotSupported(
                         f"{dset.name}: {len(dset.dims[n])} "
                         f"dimension scales attached to dimension #{n}"
                     )
@@ -610,34 +671,42 @@ class SingleHdf5ToZarr:
                     dims.append(f"phony_dim_{n}")
         return dims
 
-    def _storage_info(self, dset: h5py.Dataset) -> dict:
+    def _storage_info_and_adj_filters(self, dset: h5py.Dataset, filters: list) -> dict:
         """Get storage information of an HDF5 dataset in the HDF5 file.
 
         Storage information consists of file offset and size (length) for every
-        chunk of the HDF5 dataset.
+        chunk of the HDF5 dataset. HDF5 dataset also configs for each chunk
+        which filters are skipped by `filter_mask` (mostly in the case where a chunk is small
+        or when write directly with low-level api like `H5Dwrite_chunk`/`DatasetID.write_direct_chunk`),
+        hence a filter will be cleared if the first chunk does not apply it.
+        `Hdf5FeatherNotSupported` will be raised if chucks have heterogeneous `filter_mask` and alien chunks
+        cannot be inlined.
 
         Parameters
         ----------
         dset : h5py.Dataset
             HDF5 dataset for which to collect storage information.
+        filters: list
+            List of filters to apply to the HDF5 dataset. Will be modified in place
+            if some filters not applied.
 
         Returns
         -------
         dict
             HDF5 dataset storage information. Dict keys are chunk array offsets
             as tuples. Dict values are pairs with chunk file offset and size
-            integers.
+            integers, or data content if in need of inline for some reason.
         """
         # Empty (null) dataset...
         if dset.shape is None:
-            return dict()
+            return _NULL_CHUNK_INFO
 
         dsid = dset.id
         if dset.chunks is None:
             # Contiguous dataset...
             if dsid.get_offset() is None:
                 # No data ever written...
-                return dict()
+                return _NULL_CHUNK_INFO
             else:
                 key = (0,) * (len(dset.shape) or 1)
                 return {
@@ -648,16 +717,59 @@ class SingleHdf5ToZarr:
             num_chunks = dsid.get_num_chunks()
             if num_chunks == 0:
                 # No data ever written...
-                return dict()
+                return _NULL_CHUNK_INFO
 
             # Go over all the dataset chunks...
             stinfo = dict()
             chunk_size = dset.chunks
+            filter_mask = None  # type: None | int
 
             def get_key(blob):
                 return tuple([a // b for a, b in zip(blob.chunk_offset, chunk_size)])
 
+            def filter_filters_by_mask(
+                filter_list: List[Codec], filter_mask_
+            ) -> List[Codec]:
+                # use [2:] to remove the heading `0b` and [::-1] to reverse the order
+                bin_mask = bin(filter_mask_)[2:][::-1]
+                filters_rest = [
+                    ifilter
+                    for ifilter, imask in zip(filter_list, bin_mask)
+                    if imask == "0"
+                ]
+                return filters_rest
+
             def store_chunk_info(blob):
+                nonlocal filter_mask
+                if filter_mask is None:
+                    filter_mask = blob.filter_mask
+                elif filter_mask != blob.filter_mask:
+                    if blob.size < self.unsupported_inline_threshold:
+                        data_slc = tuple(
+                            slice(dim_offset, min(dim_offset + dim_chunk, dim_bound))
+                            for dim_offset, dim_chunk, dim_bound in zip(
+                                blob.chunk_offset, chunk_size, dset.shape
+                            )
+                        )
+                        data = _read_unsupported_direct(dset, data_slc)
+                        if data.shape != chunk_size:
+                            bg = np.full(
+                                chunk_size, dset.fillvalue, dset.dtype, order="C"
+                            )
+                            bg[tuple(slice(0, d) for d in data.shape)] = data
+                            data_flatten = bg.reshape(-1)
+                        else:
+                            data_flatten = data.reshape(-1)
+                        encoded = _filters_encode_data(
+                            data_flatten, filter_filters_by_mask(filters, filter_mask)
+                        )
+                        stinfo[get_key(blob)] = {"data": encoded}
+                        return
+                    else:
+                        raise Hdf5FeatureNotSupported(
+                            f"Dataset {dset.name} has heterogeneous `filter_mask` - "
+                            f"not supported by kerchunk."
+                        )
                 stinfo[get_key(blob)] = {"offset": blob.byte_offset, "size": blob.size}
 
             has_chunk_iter = callable(getattr(dsid, "chunk_iter", None))
@@ -668,9 +780,31 @@ class SingleHdf5ToZarr:
                 for index in range(num_chunks):
                     store_chunk_info(dsid.get_chunk_info(index))
 
+            # In most cases, the `filter_mask` should be zero, which means that all filters are applied.
+            # If a filter is skipped, the corresponding bit in the mask fill be set 1.
+            if filter_mask is not None and filter_mask != 0:
+                filters[:] = filter_filters_by_mask(filters, filter_mask)
             return stinfo
 
 
+def _read_unsupported_direct(dset, slc: Union[slice, Tuple[slice, ...]] = slice(None)):
+    try:
+        return dset[slc]
+    # As what I googled, you might get OSError/ValueError/RuntimeError
+    # when a filter in need was not registered. I met an `OSError` with
+    # bizarre error message in my test without `hdf5plugin` imported.
+    # Just simply catch all exceptions, as we will rethrow it anyway.
+    except Exception:
+        if hdf5plugin is None:
+            import warnings
+
+            warnings.warn(
+                "Attempt to directly read h5-dataset via `h5py` failed. It is recommended to "
+                "install `hdf5plugin` so that we can register extra filters, and then try again."
+            )
+        raise
+
+
 def _simple_type(x):
     if isinstance(x, bytes):
         return x.decode()
@@ -704,3 +838,12 @@ def _is_netcdf_variable(dataset: h5py.Dataset):
 
 def has_visititems_links():
     return hasattr(h5py.Group, "visititems_links")
+
+
+def _filters_encode_data(data, filters: List[Codec]):
+    for ifilter in filters:
+        data = ifilter.encode(data)
+    return bytes(data)
+
+
+_NULL_CHUNK_INFO = dict()


=====================================
kerchunk/utils.py
=====================================
@@ -217,6 +217,14 @@ def encode_fill_value(v: Any, dtype: np.dtype, compressor: Any = None) -> Any:
     # early out
     if v is None:
         return v
+
+    # Assure that v is a numpy array of the appropriate dtype
+    v = np.asanyarray(v, dtype=dtype)
+    # If v is 1D and only has 1 element, squeeze it to 0-D ndarray.
+    # When v is not 0-D, calling int/float/bool on v will fail
+    # We still need an ndarray in case dtype is a date
+    v = v.squeeze()
+
     if dtype.kind == "V" and dtype.hasobject:
         if compressor is None:
             raise ValueError("missing compressor for object array")


=====================================
tests/gen_hdf5.py
=====================================
@@ -13,5 +13,8 @@ compressors = dict(
 
 for c in compressors:
     f = h5py.File(f"hdf5_compression_{c}.h5", "w")
-    f.create_dataset("data", data=numpy.arange(100), **compressors[c])
+    # Explicit set int64, to keep compatible with Numpy 1.x in Windows
+    f.create_dataset(
+        "data", data=numpy.arange(100, dtype=numpy.int64), **compressors[c]
+    )
     f.close()


=====================================
tests/gen_mini_hdf5.py
=====================================
@@ -0,0 +1,71 @@
+import numpy
+import h5py
+import hdf5plugin
+import ujson
+import numcodecs
+import zarr
+from fsspec.implementations.reference import ReferenceFileSystem
+
+
+def _dict_add(a, b):
+    c = a.copy()
+    c.update(b)
+    return c
+
+
+compressors = dict(
+    zstd=hdf5plugin.Zstd(),
+    shuffle_zstd=_dict_add(
+        dict(shuffle=True), hdf5plugin.Zstd()
+    ),  # Test for two filters
+    blosc=hdf5plugin.Blosc(),
+    shuffle_blosc=_dict_add(dict(shuffle=True), hdf5plugin.Blosc()),
+    lz4=hdf5plugin.LZ4(),
+)
+
+for c in compressors:
+    f = h5py.File(f"hdf5_mini_{c}.h5", "w")
+    f.create_dataset("data", (3, 2), dtype=numpy.int32, **compressors[c]).write_direct(
+        numpy.arange(6, dtype=numpy.int32).reshape((3, 2))
+    )
+    f.close()
+
+f = h5py.File(f"hdf5_mali_chunk.h5", "w")
+f.create_dataset("data", (8,), dtype=numpy.int32, chunks=(7,), **hdf5plugin.Zstd())
+f["data"][0:7] = numpy.arange(7, dtype=numpy.int32)
+f["data"].id.write_direct_chunk((7,), numpy.array([7] + [0] * 6, dtype=numpy.int32), 1)
+print(f["data"][:])
+f.close()
+
+f = h5py.File(f"hdf5_mali_chunk2.h5", "w")
+f.create_dataset(
+    "data", (8,), dtype=numpy.int32, chunks=(3,), shuffle=True, **hdf5plugin.Zstd()
+)
+f["data"].id.write_direct_chunk(
+    (0,), numcodecs.Shuffle(4).encode(numpy.array([0, 1, 2], dtype=numpy.int32)), 2
+)
+f["data"].id.write_direct_chunk((3,), numpy.array([3, 4, 5], dtype=numpy.int32), 3)
+f["data"][6:] = numpy.array(
+    [
+        6,
+        7,
+    ],
+    dtype=numpy.int32,
+)
+print(f["data"][:])
+f.close()
+
+# from kerchunk.hdf import SingleHdf5ToZarr
+# for c in compressors:
+#     # if c=='lz4':
+#     #     continue
+#     with open(f'hdf5_mini_{c}.h5', 'rb') as f:
+#         print(ujson.dumps(SingleHdf5ToZarr(f, None, inline_threshold=0, unsupported_inline_threshold=1).translate(), indent=4))
+# with open(f'hdf5_mali_chunk.h5', 'rb') as f:
+#     ref = SingleHdf5ToZarr(f, None).translate()
+#     print(ujson.dumps(ref, indent=4))
+# print(zarr.group(ReferenceFileSystem(ref, target=f'hdf5_mali_chunk.h5').get_mapper())['data'][:])
+# with open(f'hdf5_mali_chunk2.h5', 'rb') as f:
+#     ref = SingleHdf5ToZarr(f, None).translate()
+#     print(ujson.dumps(ref, indent=4))
+# print(zarr.group(ReferenceFileSystem(ref, target=f'hdf5_mali_chunk2.h5').get_mapper())['data'][:])


=====================================
tests/grib_idx_fixtures/gec00_20250101_00_soilw/gec00_20250101_00_soilw.grib
=====================================
Binary files /dev/null and b/tests/grib_idx_fixtures/gec00_20250101_00_soilw/gec00_20250101_00_soilw.grib differ


=====================================
tests/hdf5_mali_chunk.h5
=====================================
Binary files /dev/null and b/tests/hdf5_mali_chunk.h5 differ


=====================================
tests/hdf5_mali_chunk2.h5
=====================================
Binary files /dev/null and b/tests/hdf5_mali_chunk2.h5 differ


=====================================
tests/hdf5_mini_blosc.h5
=====================================
Binary files /dev/null and b/tests/hdf5_mini_blosc.h5 differ


=====================================
tests/hdf5_mini_lz4.h5
=====================================
Binary files /dev/null and b/tests/hdf5_mini_lz4.h5 differ


=====================================
tests/hdf5_mini_shuffle_blosc.h5
=====================================
Binary files /dev/null and b/tests/hdf5_mini_shuffle_blosc.h5 differ


=====================================
tests/hdf5_mini_shuffle_zstd.h5
=====================================
Binary files /dev/null and b/tests/hdf5_mini_shuffle_zstd.h5 differ


=====================================
tests/hdf5_mini_zstd.h5
=====================================
Binary files /dev/null and b/tests/hdf5_mini_zstd.h5 differ


=====================================
tests/test_hdf.py
=====================================
@@ -2,7 +2,7 @@ import asyncio
 import fsspec
 import json
 import os.path as osp
-
+import hdf5plugin
 import zarr.core
 import zarr.core.buffer
 import zarr.core.group
@@ -195,7 +195,7 @@ txt = "the change of water into water vapour"
 
 def test_string_embed():
     fn = osp.join(here, "vlen.h5")
-    h = kerchunk.hdf.SingleHdf5ToZarr(fn, fn, vlen_encode="embed", error="pdb")
+    h = kerchunk.hdf.SingleHdf5ToZarr(fn, fn, vlen_encode="embed")
     out = h.translate()
 
     localfs = AsyncFileSystemWrapper(fsspec.filesystem("file"))
@@ -203,7 +203,7 @@ def test_string_embed():
     # assert txt in fs.references["vlen_str/0"]
     store = fs_as_store(fs)
     z = zarr.open(store, zarr_format=2)
-    assert z["vlen_str"].dtype == "O"
+    assert "String" in str(z["vlen_str"])
     assert z["vlen_str"][0] == txt
     assert (z["vlen_str"][1:] == "").all()
 
@@ -218,7 +218,7 @@ def test_string_pathlib():
     fs = fsspec.filesystem("reference", fo=out)
     assert txt in fs.references["vlen_str/0"]
     z = zarr.open(fs_as_store(fs))
-    assert z["vlen_str"].dtype == "O"
+    assert "String" in str(z["vlen_str"])
     assert z["vlen_str"][0] == txt
     assert (z["vlen_str"][1:] == "").all()
 
@@ -299,9 +299,10 @@ def test_compound_string_encode():
     fn = osp.join(here, "vlen2.h5")
     with open(fn, "rb") as f:
         h = kerchunk.hdf.SingleHdf5ToZarr(
-            f, fn, vlen_encode="encode", inline_threshold=0
+            f, fn, vlen_encode="encode", inline_threshold=0, error="raise"
         )
         out = h.translate()
+        print(json.dumps(out, indent=4))
     localfs = AsyncFileSystemWrapper(fsspec.filesystem("file"))
     store = refs_as_store(out, fs=localfs)
     z = zarr.open(store, zarr_format=2)
@@ -328,7 +329,9 @@ def test_compress():
 
     files = glob.glob(osp.join(here, "hdf5_compression_*.h5"))
     for f in files:
-        h = kerchunk.hdf.SingleHdf5ToZarr(f, error="raise")
+        h = kerchunk.hdf.SingleHdf5ToZarr(
+            f, error="raise", unsupported_inline_threshold=800
+        )
         if "compression_lz4" in f or "compression_bitshuffle" in f:
             with pytest.raises(RuntimeError):
                 h.translate()
@@ -339,6 +342,17 @@ def test_compress():
         g = zarr.open(store, zarr_format=2)
         assert np.mean(g["data"]) == 49.5
 
+    for f in files:
+        # Small chunk of unsupported can be inlined now
+        h = kerchunk.hdf.SingleHdf5ToZarr(
+            f, error="raise", unsupported_inline_threshold=801
+        )
+        out = h.translate()
+        localfs = AsyncFileSystemWrapper(fsspec.filesystem("file"))
+        store = refs_as_store(out, fs=localfs)
+        g = zarr.open(store, zarr_format=2)
+        assert np.mean(g["data"]) == 49.5
+
 
 # def test_embed():
 #     fn = osp.join(here, "NEONDSTowerTemperatureData.hdf5")
@@ -393,3 +407,44 @@ def test_translate_links():
             for key in z[f"{dset}_{link}"].attrs.keys():
                 if key not in kerchunk.hdf._HIDDEN_ATTRS and key != "_ARRAY_DIMENSIONS":
                     assert z[f"{dset}_{link}"].attrs[key] == z[dset].attrs[key]
+
+
+def test_small_chunks():
+    from fsspec.implementations.reference import ReferenceFileSystem
+
+    suffixes = ["blosc", "shuffle_zstd", "zstd", "lz4", "shuffle_blosc"]
+    for suffix in suffixes:
+        fname = osp.join(here, f"hdf5_mini_{suffix}.h5")
+        with open(fname, "rb") as f:
+            ref = kerchunk.hdf.SingleHdf5ToZarr(f, None).translate()
+        store = ReferenceFileSystem(ref, target=fname, asynchronous=True).get_mapper()
+        data = zarr.group(store)["data"][:]
+        assert (data == np.arange(6, dtype=np.int32).reshape((3, 2))).all()
+
+        if suffix == "lz4":
+            with pytest.raises(RuntimeError):
+                with open(fname, "rb") as f:
+                    ref = kerchunk.hdf.SingleHdf5ToZarr(
+                        f, None, unsupported_inline_threshold=0, error="raise"
+                    ).translate()
+            continue
+        with open(fname, "rb") as f:
+            ref = kerchunk.hdf.SingleHdf5ToZarr(
+                f, None, unsupported_inline_threshold=0
+            ).translate()
+        store2 = ReferenceFileSystem(ref, target=fname).get_mapper()
+        data2 = zarr.group(store2)["data"][:]
+        assert (data2 == np.arange(6, dtype=np.int32).reshape((3, 2))).all()
+
+
+def test_malicious_chunks():
+    from fsspec.implementations.reference import ReferenceFileSystem
+
+    suffixes = ["", "2"]
+    for suffix in suffixes:
+        fname = osp.join(here, f"hdf5_mali_chunk{suffix}.h5")
+        with open(fname, "rb") as f:
+            ref = kerchunk.hdf.SingleHdf5ToZarr(f, None).translate()
+        store = ReferenceFileSystem(ref, target=fname, asynchronous=True).get_mapper()
+        data = zarr.group(store)["data"][:]
+        assert (data == np.arange(8, dtype=np.int32)).all()


=====================================
tests/test_utils.py
=====================================
@@ -176,3 +176,10 @@ def test_deflate_zip_archive(m):
 
     fs = fsspec.filesystem("reference", fo=refs2)
     assert dec.decode(fs.cat("b")) == data
+
+
+def test_encode_fill_value():
+    assert kerchunk.utils.encode_fill_value(np.array([9999]), np.dtype("int")) == 9999
+    assert kerchunk.utils.encode_fill_value(np.array(9999), np.dtype("int")) == 9999
+    assert kerchunk.utils.encode_fill_value([9999], np.dtype("int")) == 9999
+    assert kerchunk.utils.encode_fill_value(9999, np.dtype("int")) == 9999



View it on GitLab: https://salsa.debian.org/debian-gis-team/kerchunk/-/commit/6aa2e9382c1fcfccd7cb6af8d4b9f44d0145abb9

-- 
View it on GitLab: https://salsa.debian.org/debian-gis-team/kerchunk/-/commit/6aa2e9382c1fcfccd7cb6af8d4b9f44d0145abb9
You're receiving this email because of your account on salsa.debian.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/pkg-grass-devel/attachments/20260404/c80a51d6/attachment-0001.htm>


More information about the Pkg-grass-devel mailing list