[Git][debian-gis-team/pyogrio][master] 7 commits: Upate Files-Excluded.
Bas Couwenberg (@sebastic)
gitlab at salsa.debian.org
Wed Nov 26 13:33:54 GMT 2025
Bas Couwenberg pushed to branch master at Debian GIS Project / pyogrio
Commits:
55f553c3 by Bas Couwenberg at 2025-11-26T14:21:45+01:00
Upate Files-Excluded.
- - - - -
b26bb6a8 by Bas Couwenberg at 2025-11-26T14:21:48+01:00
New upstream version 0.12.0+ds
- - - - -
668e021f by Bas Couwenberg at 2025-11-26T14:21:49+01:00
Update upstream source from tag 'upstream/0.12.0+ds'
Update to upstream version '0.12.0+ds'
with Debian dir 935452d8048d09865afc300c6f33a37d3bb61e50
- - - - -
d7f4a05e by Bas Couwenberg at 2025-11-26T14:22:11+01:00
New upstream release.
- - - - -
f44a8ae3 by Bas Couwenberg at 2025-11-26T14:24:02+01:00
Drop spelling-errors.patch, applied upstream.
- - - - -
1eb9fc7f by Bas Couwenberg at 2025-11-26T14:27:29+01:00
Make pytest output extra verbose.
- - - - -
d862e405 by Bas Couwenberg at 2025-11-26T14:27:42+01:00
Set distribution to unstable.
- - - - -
29 changed files:
- CHANGES.md
- README.md
- debian/changelog
- debian/copyright
- − debian/patches/series
- − debian/patches/spelling-errors.patch
- debian/rules
- docs/environment.yml
- docs/source/install.md
- docs/source/introduction.md
- pyogrio/_compat.py
- pyogrio/_err.pyx
- pyogrio/_io.pyx
- pyogrio/_ogr.pxd
- pyogrio/_ogr.pyx
- pyogrio/_version.py
- pyogrio/core.py
- pyogrio/geopandas.py
- pyogrio/raw.py
- pyogrio/tests/conftest.py
- + pyogrio/tests/fixtures/list_field_values_file.parquet
- + pyogrio/tests/fixtures/list_nested_struct_file.parquet
- pyogrio/tests/test_arrow.py
- pyogrio/tests/test_core.py
- pyogrio/tests/test_geopandas_io.py
- pyogrio/tests/test_raw_io.py
- pyogrio/util.py
- pyproject.toml
- setup.py
Changes:
=====================================
CHANGES.md
=====================================
@@ -1,5 +1,41 @@
# CHANGELOG
+## 0.12.0 (2025-11-26)
+
+### Potentially breaking changes
+
+- Return JSON fields (as identified by GDAL) as dicts/lists in `read_dataframe`;
+ these were previously returned as strings (#556).
+- Drop support for GDAL 3.4 and 3.5 (#584).
+
+### Improvements
+
+- Add `datetime_as_string` and `mixed_offsets_as_utc` parameters to `read_dataframe`
+ to choose the way datetime columns are returned + several fixes when reading and
+ writing datetimes (#486).
+- Add listing of GDAL data types and subtypes to `read_info` (#556).
+- Add support to read list fields without arrow (#558, #597).
+
+### Bug fixes
+
+- Fix decode error reading an sqlite file on Windows (#568).
+- Fix wrong layer name when creating .gpkg.zip file (#570).
+- Fix segfault on providing an invalid value for `layer` in `read_info` (#564).
+- Fix error when reading data with ``use_arrow=True`` after having used the
+ Parquet driver with GDAL>=3.12 (#601).
+
+### Packaging
+
+- Wheels are now available for Python 3.14 (#579).
+- The GDAL library included in the wheels is upgraded from 3.10.3 to 3.11.4 (#578).
+- Add libkml driver to the wheels for more recent Linux platforms supported
+ by manylinux_2_28, macOS, and Windows (#561).
+- Add libspatialite to the wheels (#546).
+- Minimum required Python version is now 3.10 (#557).
+- Initial support for free-threaded Python builds, with the extension module
+ declaring free-threaded support and wheels for Python 3.13t and 3.14t being
+ built (#562).
+
## 0.11.1 (2025-08-02)
### Bug fixes
@@ -150,7 +186,7 @@
### Improvements
-- Support reading and writing datetimes with timezones (#253).
+- Support reading and writing datetimes with time zones (#253).
- Support writing dataframes without geometry column (#267).
- Calculate feature count by iterating over features if GDAL returns an
unknown count for a data layer (e.g., OSM driver); this may have signficant
=====================================
README.md
=====================================
@@ -26,7 +26,7 @@ Read the documentation for more information:
## Requirements
-Supports Python 3.9 - 3.13 and GDAL 3.4.x - 3.9.x.
+Supports Python 3.10 - 3.14 and GDAL 3.6.x - 3.11.x.
Reading to GeoDataFrames requires `geopandas>=0.12` with `shapely>=2`.
=====================================
debian/changelog
=====================================
@@ -1,11 +1,14 @@
-pyogrio (0.11.1+ds-2) UNRELEASED; urgency=medium
+pyogrio (0.12.0+ds-1) unstable; urgency=medium
* Team upload.
+ * New upstream release.
* Update lintian overrides.
* Drop Rules-Requires-Root: no, default since dpkg 1.22.13.
* Use test-build-validate-cleanup instead of test-build-twice.
+ * Drop spelling-errors.patch, applied upstream.
+ * Make pytest output extra verbose.
- -- Bas Couwenberg <sebastic at debian.org> Fri, 12 Sep 2025 17:45:39 +0200
+ -- Bas Couwenberg <sebastic at debian.org> Wed, 26 Nov 2025 14:27:30 +0100
pyogrio (0.11.1+ds-1) unstable; urgency=medium
=====================================
debian/copyright
=====================================
@@ -6,8 +6,6 @@ Comment: The upstream release is repacked in order to exclude
unneeded CI files and binary data.
Files-Excluded: .github/workflows
ci
- pyogrio/tests/fixtures/poly_not_enough_points.shp.zip
- pyogrio/tests/fixtures/test_fgdb.gdb.zip
Files: *
Copyright: 2020-2024, Brendan C. Ward and pyogrio contributors
=====================================
debian/patches/series deleted
=====================================
@@ -1 +0,0 @@
-spelling-errors.patch
=====================================
debian/patches/spelling-errors.patch deleted
=====================================
@@ -1,17 +0,0 @@
-Description: Fix spelling errors:
- * occuring -> occurring
-Author: Bas Couwenberg <sebastic at debian.org>
-Forwarded: https://github.com/geopandas/pyogrio/pull/554
-Applied-Upstream: https://github.com/geopandas/pyogrio/commit/4ca3db0bc099f22bbae4c12c43951cd87415a1f6
-
---- a/pyogrio/_err.pyx
-+++ b/pyogrio/_err.pyx
-@@ -418,7 +418,7 @@ cdef void stacking_error_handler(
-
- @contextlib.contextmanager
- def capture_errors():
-- """A context manager that captures all GDAL non-fatal errors occuring.
-+ """A context manager that captures all GDAL non-fatal errors occurring.
-
- It adds all errors to a single stack, so it assumes that no more than one
- GDAL function is called.
=====================================
debian/rules
=====================================
@@ -1,8 +1,9 @@
#!/usr/bin/make -f
export DEB_BUILD_MAINT_OPTIONS=hardening=+all
+
export PYBUILD_NAME=pyogrio
-export PYBUILD_TEST_ARGS=\
+export PYBUILD_TEST_ARGS=-vv \
-k "not test_url \
and not test_url_dataframe \
and not test_url_with_zip \
=====================================
docs/environment.yml
=====================================
@@ -2,9 +2,9 @@ name: pyogrio
channels:
- conda-forge
dependencies:
- - python==3.9.*
+ - python==3.10.*
- gdal
- - numpy==1.19.*
+ - numpy==1.24.*
- numpydoc==1.1.*
- Cython==0.29.*
- docutils==0.16.*
=====================================
docs/source/install.md
=====================================
@@ -2,7 +2,7 @@
## Requirements
-Supports Python 3.9 - 3.13 and GDAL 3.4.x - 3.9.x
+Supports Python 3.10 - 3.14 and GDAL 3.6.x - 3.11.x
Reading to GeoDataFrames requires `geopandas>=0.12` with `shapely>=2`.
@@ -132,20 +132,20 @@ To build on Windows, you need to provide additional environment variables or
command-line parameters because the location of the GDAL binaries and headers
cannot be automatically determined.
-Assuming GDAL 3.4.1 is installed to `c:\GDAL`, you can set the `GDAL_INCLUDE_PATH`,
+Assuming GDAL 3.8.3 is installed to `c:\GDAL`, you can set the `GDAL_INCLUDE_PATH`,
`GDAL_LIBRARY_PATH` and `GDAL_VERSION` environment variables and build as follows:
```bash
set GDAL_INCLUDE_PATH=C:\GDAL\include
set GDAL_LIBRARY_PATH=C:\GDAL\lib
-set GDAL_VERSION=3.4.1
+set GDAL_VERSION=3.8.3
python -m pip install --no-deps --force-reinstall --no-use-pep517 -e . -v
```
Alternatively, you can pass those options also as command-line parameters:
```bash
-python -m pip install --install-option=build_ext --install-option="-IC:\GDAL\include" --install-option="-lgdal_i" --install-option="-LC:\GDAL\lib" --install-option="--gdalversion=3.4.1" --no-deps --force-reinstall --no-use-pep517 -e . -v
+python -m pip install --install-option=build_ext --install-option="-IC:\GDAL\include" --install-option="-lgdal_i" --install-option="-LC:\GDAL\lib" --install-option="--gdalversion=3.8.3" --no-deps --force-reinstall --no-use-pep517 -e . -v
```
The location of the GDAL DLLs must be on your system `PATH`.
=====================================
docs/source/introduction.md
=====================================
@@ -481,13 +481,17 @@ Not all file formats have dedicated support to store datetime data, like ESRI
Shapefile. For such formats, or if you require precision > ms, a workaround is to
convert the datetimes to string.
-Timezone information is preserved where possible, however GDAL only represents
-time zones as UTC offsets, whilst pandas uses IANA time zones (via `pytz` or
-`zoneinfo`). This means that dataframes with columns containing multiple offsets
-(e.g. when switching from standard time to summer time) will be written correctly,
-but when read via `pyogrio.read_dataframe()` will be returned as a UTC datetime
-column, as there is no way to reconstruct the original timezone from the individual
-offsets present.
+When you have datetime columns with time zone information, it is important to
+note that GDAL only represents time zones as UTC offsets, whilst pandas uses
+IANA time zones (via `pytz` or `zoneinfo`). As a result, even if a column in a
+DataFrame contains datetimes in a single time zone, this will often still result
+in mixed time zone offsets being written for time zones where daylight saving
+time is used (e.g. +01:00 and +02:00 offsets for time zone Europe/Brussels).
+When roundtripping through GDAL, the information about the original time zone
+is lost, only the offsets can be preserved. By default,
+{func}`pyogrio.read_dataframe()` will convert columns with mixed offsets to UTC
+to return a datetime64 column. If you want to preserve the original offsets,
+you can use `datetime_as_string=True` or `mixed_offsets_as_utc=False`.
## Dataset and layer creation options
=====================================
pyogrio/_compat.py
=====================================
@@ -29,7 +29,6 @@ except ImportError:
pandas = None
-HAS_ARROW_API = __gdal_version__ >= (3, 6, 0)
HAS_ARROW_WRITE_API = __gdal_version__ >= (3, 8, 0)
HAS_PYARROW = pyarrow is not None
HAS_PYPROJ = pyproj is not None
@@ -42,9 +41,9 @@ HAS_GEOPANDAS = geopandas is not None
PANDAS_GE_15 = pandas is not None and Version(pandas.__version__) >= Version("1.5.0")
PANDAS_GE_20 = pandas is not None and Version(pandas.__version__) >= Version("2.0.0")
PANDAS_GE_22 = pandas is not None and Version(pandas.__version__) >= Version("2.2.0")
+PANDAS_GE_23 = pandas is not None and Version(pandas.__version__) >= Version("2.3.0")
PANDAS_GE_30 = pandas is not None and Version(pandas.__version__) >= Version("3.0.0dev")
-GDAL_GE_352 = __gdal_version__ >= (3, 5, 2)
GDAL_GE_37 = __gdal_version__ >= (3, 7, 0)
GDAL_GE_38 = __gdal_version__ >= (3, 8, 0)
GDAL_GE_311 = __gdal_version__ >= (3, 11, 0)
=====================================
pyogrio/_err.pyx
=====================================
@@ -418,7 +418,7 @@ cdef void stacking_error_handler(
@contextlib.contextmanager
def capture_errors():
- """A context manager that captures all GDAL non-fatal errors occuring.
+ """A context manager that captures all GDAL non-fatal errors occurring.
It adds all errors to a single stack, so it assumes that no more than one
GDAL function is called.
=====================================
pyogrio/_io.pyx
=====================================
@@ -24,11 +24,9 @@ from cpython.pycapsule cimport PyCapsule_New, PyCapsule_GetPointer
import numpy as np
-from pyogrio._ogr cimport *
from pyogrio._err cimport (
check_last_error, check_int, check_pointer, ErrorHandler
)
-from pyogrio._vsi cimport *
from pyogrio._err import (
CPLE_AppDefinedError,
CPLE_BaseError,
@@ -38,6 +36,9 @@ from pyogrio._err import (
capture_errors,
)
from pyogrio._geometry cimport get_geometry_type, get_geometry_type_code
+from pyogrio._ogr cimport *
+from pyogrio._ogr import MULTI_EXTENSIONS
+from pyogrio._vsi cimport *
from pyogrio.errors import (
CRSError, DataSourceError, DataLayerError, GeometryError, FieldError, FeatureError
)
@@ -49,11 +50,11 @@ log = logging.getLogger(__name__)
# (index in array is the integer field type)
FIELD_TYPES = [
"int32", # OFTInteger, Simple 32bit integer
- None, # OFTIntegerList, List of 32bit integers, not supported
+ "list(int32)", # OFTIntegerList, List of 32bit integers
"float64", # OFTReal, Double Precision floating point
- None, # OFTRealList, List of doubles, not supported
+ "list(float64)", # OFTRealList, List of doubles
"object", # OFTString, String of UTF-8 chars
- None, # OFTStringList, Array of strings, not supported
+ "list(str)", # OFTStringList, Array of strings
None, # OFTWideString, deprecated, not supported
None, # OFTWideStringList, deprecated, not supported
"object", # OFTBinary, Raw Binary data
@@ -61,9 +62,28 @@ FIELD_TYPES = [
None, # OFTTime, Time, NOTE: not directly supported in numpy
"datetime64[ms]", # OFTDateTime, Date and Time
"int64", # OFTInteger64, Single 64bit integer
- None # OFTInteger64List, List of 64bit integers, not supported
+ "list(int64)" # OFTInteger64List, List of 64bit integers, not supported
]
+# Mapping of OGR integer field types to OGR type names
+# (index in array is the integer field type)
+FIELD_TYPE_NAMES = {
+ OFTInteger: "OFTInteger", # Simple 32bit integer
+ OFTIntegerList: "OFTIntegerList", # List of 32bit integers, not supported
+ OFTReal: "OFTReal", # Double Precision floating point
+ OFTRealList: "OFTRealList", # List of doubles, not supported
+ OFTString: "OFTString", # String of UTF-8 chars
+ OFTStringList: "OFTStringList", # Array of strings, not supported
+ OFTWideString: "OFTWideString", # deprecated, not supported
+ OFTWideStringList: "OFTWideStringList", # deprecated, not supported
+ OFTBinary: "OFTBinary", # Raw Binary data
+ OFTDate: "OFTDate", # Date
+ OFTTime: "OFTTime", # Time: not directly supported in numpy
+ OFTDateTime: "OFTDateTime", # Date and Time
+ OFTInteger64: "OFTInteger64", # Single 64bit integer
+ OFTInteger64List: "OFTInteger64List", # List of 64bit integers, not supported
+}
+
FIELD_SUBTYPES = {
OFSTNone: None, # No subtype
OFSTBoolean: "bool", # Boolean integer
@@ -71,6 +91,16 @@ FIELD_SUBTYPES = {
OFSTFloat32: "float32", # Single precision (32 bit) floating point
}
+FIELD_SUBTYPE_NAMES = {
+ OFSTNone: "OFSTNone", # No subtype
+ OFSTBoolean: "OFSTBoolean", # Boolean integer
+ OFSTInt16: "OFSTInt16", # Signed 16-bit integer
+ OFSTFloat32: "OFSTFloat32", # Single precision (32 bit) floating point
+ OFSTJSON: "OFSTJSON",
+ OFSTUUID: "OFSTUUID",
+ OFSTMaxSubType: "OFSTMaxSubType",
+}
+
# Mapping of numpy ndarray dtypes to (field type, subtype)
DTYPE_OGR_FIELD_TYPES = {
"int8": (OFTInteger, OFSTInt16),
@@ -274,6 +304,10 @@ cdef OGRLayerH get_ogr_layer(GDALDatasetH ogr_dataset, layer) except NULL:
elif isinstance(layer, int):
ogr_layer = check_pointer(GDALDatasetGetLayer(ogr_dataset, layer))
+ else:
+ raise ValueError(
+ f"'layer' parameter must be a str or int, got {type(layer)}"
+ )
# GDAL does not always raise exception messages in this case
except NullPointerError:
@@ -610,6 +644,11 @@ cdef detect_encoding(OGRDataSourceH ogr_dataset, OGRLayerH ogr_layer):
# In old gdal versions, OLCStringsAsUTF8 wasn't advertised yet.
return "UTF-8"
+ if driver == "SQLite":
+ # TestCapability for OLCStringsAsUTF8 returns False for SQLite in GDAL 3.11.3.
+ # Issue opened: https://github.com/OSGeo/gdal/issues/12962
+ return "UTF-8"
+
return locale.getpreferredencoding()
@@ -627,8 +666,8 @@ cdef get_fields(OGRLayerH ogr_layer, str encoding, use_arrow=False):
Returns
-------
- ndarray(n, 4)
- array of index, ogr type, name, numpy type
+ ndarray(n, 5)
+ array of index, ogr type, name, numpy type, ogr subtype
"""
cdef int i
cdef int field_count
@@ -648,7 +687,7 @@ cdef get_fields(OGRLayerH ogr_layer, str encoding, use_arrow=False):
field_count = OGR_FD_GetFieldCount(ogr_featuredef)
- fields = np.empty(shape=(field_count, 4), dtype=object)
+ fields = np.empty(shape=(field_count, 5), dtype=object)
fields_view = fields[:, :]
skipped_fields = False
@@ -685,6 +724,7 @@ cdef get_fields(OGRLayerH ogr_layer, str encoding, use_arrow=False):
fields_view[i, 1] = field_type
fields_view[i, 2] = field_name
fields_view[i, 3] = np_type
+ fields_view[i, 4] = field_subtype
if skipped_fields:
# filter out skipped fields
@@ -879,6 +919,10 @@ cdef process_fields(
cdef int success
cdef int field_index
cdef int ret_length
+ cdef int *ints_c
+ cdef GIntBig *int64s_c
+ cdef double *doubles_c
+ cdef char **strings_c
cdef GByte *bin_value
cdef int year = 0
cdef int month = 0
@@ -936,10 +980,16 @@ cdef process_fields(
if datetime_as_string:
# defer datetime parsing to user/ pandas layer
- # Update to OGR_F_GetFieldAsISO8601DateTime when GDAL 3.7+ only
- data[i] = get_string(
- OGR_F_GetFieldAsString(ogr_feature, field_index), encoding=encoding
- )
+ IF CTE_GDAL_VERSION >= (3, 7, 0):
+ data[i] = get_string(
+ OGR_F_GetFieldAsISO8601DateTime(ogr_feature, field_index, NULL),
+ encoding=encoding,
+ )
+ ELSE:
+ data[i] = get_string(
+ OGR_F_GetFieldAsString(ogr_feature, field_index),
+ encoding=encoding,
+ )
else:
success = OGR_F_GetFieldAsDateTimeEx(
ogr_feature,
@@ -969,6 +1019,55 @@ cdef process_fields(
year, month, day, hour, minute, second, microsecond
).isoformat()
+ elif field_type == OFTIntegerList:
+ # According to GDAL doc, this can return NULL for an empty list, which is a
+ # valid result. So don't use check_pointer as it would throw an exception.
+ ints_c = OGR_F_GetFieldAsIntegerList(ogr_feature, field_index, &ret_length)
+ int_arr = np.ndarray(shape=(ret_length,), dtype=np.int32)
+ for j in range(ret_length):
+ int_arr[j] = ints_c[j]
+ data[i] = int_arr
+
+ elif field_type == OFTInteger64List:
+ # According to GDAL doc, this can return NULL for an empty list, which is a
+ # valid result. So don't use check_pointer as it would throw an exception.
+ int64s_c = OGR_F_GetFieldAsInteger64List(
+ ogr_feature, field_index, &ret_length
+ )
+
+ int_arr = np.ndarray(shape=(ret_length,), dtype=np.int64)
+ for j in range(ret_length):
+ int_arr[j] = int64s_c[j]
+ data[i] = int_arr
+
+ elif field_type == OFTRealList:
+ # According to GDAL doc, this can return NULL for an empty list, which is a
+ # valid result. So don't use check_pointer as it would throw an exception.
+ doubles_c = OGR_F_GetFieldAsDoubleList(
+ ogr_feature, field_index, &ret_length
+ )
+
+ double_arr = np.ndarray(shape=(ret_length,), dtype=np.float64)
+ for j in range(ret_length):
+ double_arr[j] = doubles_c[j]
+ data[i] = double_arr
+
+ elif field_type == OFTStringList:
+ # According to GDAL doc, this can return NULL for an empty list, which is a
+ # valid result. So don't use check_pointer as it would throw an exception.
+ strings_c = OGR_F_GetFieldAsStringList(ogr_feature, field_index)
+
+ string_list_index = 0
+ vals = []
+ if strings_c != NULL:
+ # According to GDAL doc, the list is terminated by a NULL pointer.
+ while strings_c[string_list_index] != NULL:
+ val = strings_c[string_list_index]
+ vals.append(get_string(val, encoding=encoding))
+ string_list_index += 1
+
+ data[i] = np.array(vals)
+
@cython.boundscheck(False) # Deactivate bounds checking
@cython.wraparound(False) # Deactivate negative indexing.
@@ -1012,16 +1111,16 @@ cdef get_features(
field_indexes = fields[:, 0]
field_ogr_types = fields[:, 1]
- field_data = [
- np.empty(
- shape=(num_features, ),
- dtype = (
- "object"
- if datetime_as_string and fields[field_index, 3].startswith("datetime")
- else fields[field_index, 3]
- )
- ) for field_index in range(n_fields)
- ]
+ field_data = []
+ for field_index in range(n_fields):
+ if datetime_as_string and fields[field_index, 3].startswith("datetime"):
+ dtype = "object"
+ elif fields[field_index, 3].startswith("list"):
+ dtype = "object"
+ else:
+ dtype = fields[field_index, 3]
+
+ field_data.append(np.empty(shape=(num_features, ), dtype=dtype))
field_data_view = [field_data[field_index][:] for field_index in range(n_fields)]
@@ -1413,11 +1512,18 @@ def ogr_read(
datetime_as_string=datetime_as_string
)
+ ogr_types = [FIELD_TYPE_NAMES.get(field[1], "Unknown") for field in fields]
+ ogr_subtypes = [
+ FIELD_SUBTYPE_NAMES.get(field[4], "Unknown") for field in fields
+ ]
+
meta = {
"crs": crs,
"encoding": encoding,
"fields": fields[:, 2],
"dtypes": fields[:, 3],
+ "ogr_types": ogr_types,
+ "ogr_subtypes": ogr_subtypes,
"geometry_type": geometry_type,
}
@@ -1502,6 +1608,7 @@ def ogr_open_arrow(
int return_fids=False,
int batch_size=0,
use_pyarrow=False,
+ datetime_as_string=False,
):
cdef int err = 0
@@ -1520,9 +1627,6 @@ def ogr_open_arrow(
cdef ArrowArrayStream* stream
cdef ArrowSchema schema
- IF CTE_GDAL_VERSION < (3, 6, 0):
- raise RuntimeError("Need GDAL>=3.6 for Arrow support")
-
if force_2d:
raise ValueError("forcing 2D is not supported for Arrow")
@@ -1722,6 +1826,12 @@ def ogr_open_arrow(
"GEOARROW".encode("UTF-8")
)
+ # Read DateTime fields as strings, as the Arrow DateTime column type is
+ # quite limited regarding support for mixed time zones,...
+ IF CTE_GDAL_VERSION >= (3, 11, 0):
+ if datetime_as_string:
+ options = CSLSetNameValue(options, "DATETIME_AS_STRING", "YES")
+
# make sure layer is read from beginning
OGR_L_ResetReading(ogr_layer)
@@ -1745,10 +1855,18 @@ def ogr_open_arrow(
else:
reader = _ArrowStream(capsule)
+ ogr_types = [FIELD_TYPE_NAMES.get(field[1], "Unknown") for field in fields]
+ ogr_subtypes = [
+ FIELD_SUBTYPE_NAMES.get(field[4], "Unknown") for field in fields
+ ]
+
meta = {
"crs": crs,
"encoding": encoding,
"fields": fields[:, 2],
+ "dtypes": fields[:, 3],
+ "ogr_types": ogr_types,
+ "ogr_subtypes": ogr_subtypes,
"geometry_type": geometry_type,
"geometry_name": geometry_name,
"fid_column": fid_column,
@@ -1905,6 +2023,10 @@ def ogr_read_info(
encoding = encoding or detect_encoding(ogr_dataset, ogr_layer)
fields = get_fields(ogr_layer, encoding)
+ ogr_types = [FIELD_TYPE_NAMES.get(field[1], "Unknown") for field in fields]
+ ogr_subtypes = [
+ FIELD_SUBTYPE_NAMES.get(field[4], "Unknown") for field in fields
+ ]
meta = {
"layer_name": get_string(OGR_L_GetName(ogr_layer)),
@@ -1912,6 +2034,8 @@ def ogr_read_info(
"encoding": encoding,
"fields": fields[:, 2],
"dtypes": fields[:, 3],
+ "ogr_types": ogr_types,
+ "ogr_subtypes": ogr_subtypes,
"fid_column": get_string(OGR_L_GetFIDColumn(ogr_layer)),
"geometry_name": get_string(OGR_L_GetGeometryColumn(ogr_layer)),
"geometry_type": get_geometry_type(ogr_layer),
@@ -2236,7 +2360,15 @@ cdef create_ogr_dataset_layer(
path_exists = os.path.exists(path) if not use_tmp_vsimem else False
if not layer:
- layer = os.path.splitext(os.path.split(path)[1])[0]
+ # For multi extensions (e.g. ".shp.zip"), strip the full extension
+ for multi_ext in MULTI_EXTENSIONS:
+ if path.endswith(multi_ext):
+ layer = os.path.split(path)[1][:-len(multi_ext)]
+ break
+
+ # If it wasn't a multi-extension, use the file stem
+ if not layer:
+ layer = os.path.splitext(os.path.split(path)[1])[0]
# if shapefile, GeoJSON, or FlatGeobuf, always delete first
# for other types, check if we can create layers
=====================================
pyogrio/_ogr.pxd
=====================================
@@ -185,6 +185,9 @@ cdef extern from "ogr_core.h":
OFSTBoolean
OFSTInt16
OFSTFloat32
+ OFSTJSON
+ OFSTUUID
+ OFSTMaxSubType
ctypedef void* OGRDataSourceH
ctypedef void* OGRFeatureDefnH
@@ -256,6 +259,7 @@ cdef extern from "arrow_bridge.h" nogil:
cdef extern from "ogr_api.h":
+ ctypedef signed long long GIntBig
int OGRGetDriverCount()
OGRSFDriverH OGRGetDriver(int)
@@ -283,6 +287,14 @@ cdef extern from "ogr_api.h":
int OGR_F_GetFieldAsInteger(OGRFeatureH feature, int n)
int64_t OGR_F_GetFieldAsInteger64(OGRFeatureH feature, int n)
const char* OGR_F_GetFieldAsString(OGRFeatureH feature, int n)
+ char ** OGR_F_GetFieldAsStringList(OGRFeatureH feature, int n)
+ const int * OGR_F_GetFieldAsIntegerList(
+ OGRFeatureH feature, int n, int* pnCount)
+ const GIntBig * OGR_F_GetFieldAsInteger64List(
+ OGRFeatureH feature, int n, int* pnCount)
+ const double * OGR_F_GetFieldAsDoubleList(
+ OGRFeatureH feature, int n, int* pnCount)
+
int OGR_F_IsFieldSetAndNotNull(OGRFeatureH feature, int n)
void OGR_F_SetFieldDateTime(OGRFeatureH feature,
@@ -406,12 +418,16 @@ cdef extern from "ogr_api.h":
const char* OLCFastGetExtent
const char* OLCTransactions
+cdef extern from "ogr_api.h":
+ bint OGR_L_GetArrowStream(
+ OGRLayerH hLayer, ArrowArrayStream *out_stream, char** papszOptions
+ )
-IF CTE_GDAL_VERSION >= (3, 6, 0):
+IF CTE_GDAL_VERSION >= (3, 7, 0):
cdef extern from "ogr_api.h":
- bint OGR_L_GetArrowStream(
- OGRLayerH hLayer, ArrowArrayStream *out_stream, char** papszOptions
+ const char* OGR_F_GetFieldAsISO8601DateTime(
+ OGRFeatureH feature, int n, char** papszOptions
)
=====================================
pyogrio/_ogr.pyx
=====================================
@@ -1,12 +1,13 @@
import os
import sys
-from uuid import uuid4
import warnings
from pyogrio._err cimport check_pointer
from pyogrio._err import CPLE_BaseError, NullPointerError
from pyogrio.errors import DataSourceError
+MULTI_EXTENSIONS = (".gpkg.zip", ".shp.zip")
+
cdef get_string(const char *c_str, str encoding="UTF-8"):
"""Get Python string from a char *.
@@ -42,21 +43,16 @@ def get_gdal_version_string():
return get_string(version)
-IF CTE_GDAL_VERSION >= (3, 4, 0):
-
- cdef extern from "ogr_api.h":
- bint OGRGetGEOSVersion(int *pnMajor, int *pnMinor, int *pnPatch)
+cdef extern from "ogr_api.h":
+ bint OGRGetGEOSVersion(int *pnMajor, int *pnMinor, int *pnPatch)
def get_gdal_geos_version():
cdef int major, minor, revision
- IF CTE_GDAL_VERSION >= (3, 4, 0):
- if not OGRGetGEOSVersion(&major, &minor, &revision):
- return None
- return (major, minor, revision)
- ELSE:
+ if not OGRGetGEOSVersion(&major, &minor, &revision):
return None
+ return (major, minor, revision)
def set_gdal_config_options(dict options):
@@ -165,7 +161,7 @@ def get_gdal_data_path():
"""
cdef const char *path_c = CPLFindFile("gdal", "header.dxf")
if path_c != NULL:
- return get_string(path_c).rstrip("header.dxf")
+ return get_string(path_c).replace("header.dxf", "")
return None
@@ -336,10 +332,10 @@ def _get_drivers_for_path(path):
# allow specific drivers to have a .zip extension to match GDAL behavior
if ext == "zip":
- if path.endswith(".shp.zip"):
- ext = "shp.zip"
- elif path.endswith(".gpkg.zip"):
- ext = "gpkg.zip"
+ for multi_ext in MULTI_EXTENSIONS:
+ if path.endswith(multi_ext):
+ ext = multi_ext[1:] # strip leading dot
+ break
drivers = []
for i in range(OGRGetDriverCount()):
=====================================
pyogrio/_version.py
=====================================
@@ -25,9 +25,9 @@ def get_keywords():
# setup.py/versioneer.py will grep for the variable names, so they must
# each be defined on a line of their own. _version.py will just call
# get_keywords().
- git_refnames = " (HEAD -> main, tag: v0.11.1)"
- git_full = "d3ff55ba80ea5f1744d40f7502adec3658d91b15"
- git_date = "2025-08-02 21:41:37 +0200"
+ git_refnames = " (HEAD -> main, tag: v0.12.0)"
+ git_full = "ea9a97b6aef45c921ea36b599666e7e83b84070c"
+ git_date = "2025-11-26 10:18:55 +0100"
keywords = {"refnames": git_refnames, "full": git_full, "date": git_date}
return keywords
=====================================
pyogrio/core.py
=====================================
@@ -1,7 +1,6 @@
"""Core functions to interact with OGR data sources."""
from pathlib import Path
-from typing import Optional, Union
from pyogrio._env import GDALEnv
from pyogrio.util import (
@@ -237,9 +236,9 @@ def read_info(
----------
path_or_buffer : str, pathlib.Path, bytes, or file-like
A dataset path or URI, raw buffer, or file-like object with a read method.
- layer : [type], optional
+ layer : str or int, optional
Name or index of layer in data source. Reads the first layer by default.
- encoding : [type], optional (default: None)
+ encoding : str, optional (default: None)
If present, will be used as the encoding for reading string values from
the data source, unless encoding can be inferred directly from the data
source.
@@ -261,6 +260,8 @@ def read_info(
"crs": "<crs>",
"fields": <ndarray of field names>,
"dtypes": <ndarray of field dtypes>,
+ "ogr_types": <ndarray of OGR field types>,
+ "ogr_subtypes": <ndarray of OGR field subtypes>,
"encoding": "<encoding>",
"fid_column": "<fid column name or "">",
"geometry_name": "<geometry column name or "">",
@@ -336,7 +337,7 @@ def get_gdal_data_path():
return _get_gdal_data_path()
-def vsi_listtree(path: Union[str, Path], pattern: Optional[str] = None):
+def vsi_listtree(path: str | Path, pattern: str | None = None):
"""Recursively list the contents of a VSI directory.
An fnmatch pattern can be specified to filter the directories/files
@@ -356,7 +357,7 @@ def vsi_listtree(path: Union[str, Path], pattern: Optional[str] = None):
return ogr_vsi_listtree(path, pattern=pattern)
-def vsi_rmtree(path: Union[str, Path]):
+def vsi_rmtree(path: str | Path):
"""Recursively remove VSI directory.
Parameters
@@ -371,7 +372,7 @@ def vsi_rmtree(path: Union[str, Path]):
ogr_vsi_rmtree(path)
-def vsi_unlink(path: Union[str, Path]):
+def vsi_unlink(path: str | Path):
"""Remove a VSI file.
Parameters
=====================================
pyogrio/geopandas.py
=====================================
@@ -1,17 +1,21 @@
"""Functions for reading and writing GeoPandas dataframes."""
+import json
import os
import warnings
+from datetime import datetime
import numpy as np
from pyogrio._compat import (
HAS_GEOPANDAS,
+ HAS_PYARROW,
PANDAS_GE_15,
PANDAS_GE_20,
PANDAS_GE_22,
PANDAS_GE_30,
PYARROW_GE_19,
+ __gdal_version__,
)
from pyogrio.errors import DataSourceError
from pyogrio.raw import (
@@ -37,33 +41,87 @@ def _stringify_path(path):
return path
-def _try_parse_datetime(ser):
+def _try_parse_datetime(ser, datetime_as_string: bool, mixed_offsets_as_utc: bool):
import pandas as pd # only called when pandas is known to be installed
+ from pandas.api.types import is_string_dtype
+
+ datetime_kwargs = {}
+ if datetime_as_string:
+ if not is_string_dtype(ser.dtype):
+ # Support to return datetimes as strings using arrow only available for
+ # GDAL >= 3.11, so convert to string here if needed.
+ res = ser.astype("str")
+ if not PANDAS_GE_30:
+ # astype("str") also stringifies missing values in pandas < 3
+ res[ser.isna()] = None
+ res = res.str.replace(" ", "T")
+ return res
+ if __gdal_version__ < (3, 7, 0):
+ # GDAL < 3.7 doesn't return datetimes in ISO8601 format, so fix that
+ return ser.str.replace(" ", "T").str.replace("/", "-")
+ return ser
if PANDAS_GE_22:
- datetime_kwargs = {"format": "ISO8601"}
+ datetime_kwargs["format"] = "ISO8601"
elif PANDAS_GE_20:
- datetime_kwargs = {"format": "ISO8601", "errors": "ignore"}
+ datetime_kwargs["format"] = "ISO8601"
+ datetime_kwargs["errors"] = "ignore"
else:
- datetime_kwargs = {"yearfirst": True}
+ datetime_kwargs["yearfirst"] = True
+
with warnings.catch_warnings():
warnings.filterwarnings(
"ignore",
".*parsing datetimes with mixed time zones will raise.*",
FutureWarning,
)
- # pre-emptive try catch for when pandas will raise
- # (can tighten the exception type in future when it does)
+
+ warning = "Error parsing datetimes, original strings are returned: {message}"
try:
res = pd.to_datetime(ser, **datetime_kwargs)
- except Exception:
- res = ser
- # if object dtype, try parse as utc instead
- if res.dtype in ("object", "string"):
+
+ # With pandas >2 and <3, mixed time zones were returned as pandas
+ # Timestamps, so convert them to datetime objects.
+ if not mixed_offsets_as_utc and PANDAS_GE_20 and res.dtype == "object":
+ res = res.map(lambda x: x.to_pydatetime(), na_action="ignore")
+
+ except Exception as ex:
+ if isinstance(ex, ValueError) and "Mixed timezones detected" in str(ex):
+ # Parsing mixed time zones with to_datetime is not supported
+ # anymore in pandas >= 3.0, leading to a ValueError.
+ if mixed_offsets_as_utc:
+ # Convert mixed time zone datetimes to UTC.
+ try:
+ res = pd.to_datetime(ser, utc=True, **datetime_kwargs)
+ except Exception as ex:
+ warnings.warn(warning.format(message=str(ex)), stacklevel=3)
+ return ser
+ else:
+ # Using map seems to be the fastest way to convert the strings to
+ # datetime objects.
+ try:
+ res = ser.map(datetime.fromisoformat, na_action="ignore")
+ except Exception as ex:
+ warnings.warn(warning.format(message=str(ex)), stacklevel=3)
+ return ser
+
+ else:
+ # If the error is not related to mixed time zones, log it and return
+ # the original series.
+ warnings.warn(warning.format(message=str(ex)), stacklevel=3)
+ if __gdal_version__ < (3, 7, 0):
+ # GDAL < 3.7 doesn't return datetimes in ISO8601 format, so fix that
+ return ser.str.replace(" ", "T").str.replace("/", "-")
+
+ return ser
+
+ # For pandas < 3.0, to_datetime converted mixed time zone data to datetime objects.
+ # For mixed_offsets_as_utc they should be converted to UTC though...
+ if mixed_offsets_as_utc and res.dtype in ("object", "string"):
try:
res = pd.to_datetime(ser, utc=True, **datetime_kwargs)
- except Exception:
- pass
+ except Exception as ex:
+ warnings.warn(warning.format(message=str(ex)), stacklevel=3)
if res.dtype.kind == "M": # any datetime64
# GDAL only supports ms precision, convert outputs to match.
@@ -73,6 +131,7 @@ def _try_parse_datetime(ser):
res = res.dt.as_unit("ms")
else:
res = res.dt.round(freq="ms")
+
return res
@@ -96,6 +155,8 @@ def read_dataframe(
use_arrow=None,
on_invalid="raise",
arrow_to_pandas_kwargs=None,
+ datetime_as_string=False,
+ mixed_offsets_as_utc=True,
**kwargs,
):
"""Read from an OGR data source to a GeoPandas GeoDataFrame or Pandas DataFrame.
@@ -103,6 +164,9 @@ def read_dataframe(
If the data source does not have a geometry column or ``read_geometry`` is False,
a DataFrame will be returned.
+ If you read data with datetime columns containing time zone information, check out
+ the notes below.
+
Requires ``geopandas`` >= 0.8.
Parameters
@@ -223,14 +287,55 @@ def read_dataframe(
arrow_to_pandas_kwargs : dict, optional (default: None)
When `use_arrow` is True, these kwargs will be passed to the `to_pandas`_
call for the arrow to pandas conversion.
+ datetime_as_string : bool, optional (default: False)
+ If True, will return datetime columns as detected by GDAL as ISO8601
+ strings and ``mixed_offsets_as_utc`` will be ignored.
+ mixed_offsets_as_utc: bool, optional (default: True)
+ By default, datetime columns are read as the pandas datetime64 dtype.
+ This can represent the data as-is in the case that the column contains
+ only naive datetimes (without time zone information), only UTC datetimes,
+ or if all datetimes in the column have the same time zone offset. Note
+ that in time zones with daylight saving time, datetimes will have
+ different offsets throughout the year!
+
+ For columns that don't comply with the above, i.e. columns that contain
+ mixed offsets, the behavior depends on the value of this parameter:
+
+ - If ``True`` (default), such datetimes are converted to UTC. In the case
+ of a mixture of time zone aware and naive datetimes, the naive
+ datetimes are assumed to be in UTC already. Datetime columns returned
+ will always be pandas datetime64.
+ - If ``False``, such datetimes with mixed offsets are returned with
+ those offsets preserved. Because pandas datetime64 columns don't
+ support mixed time zone offsets, such columns are returned as object
+ columns with python datetime values with fixed offsets. If you want
+ to roundtrip datetimes without data loss, this is the recommended
+ option, but you lose the functionality of a datetime64 column.
+
+ If ``datetime_as_string`` is True, this option is ignored.
+
**kwargs
- Additional driver-specific dataset open options passed to OGR. Invalid
+ Additional driver-specific dataset open options passed to OGR. Invalid
options will trigger a warning.
Returns
-------
GeoDataFrame or DataFrame (if no geometry is present)
+ Notes
+ -----
+ When you have datetime columns with time zone information, it is important to
+ note that GDAL only represents time zones as UTC offsets, whilst pandas uses
+ IANA time zones (via `pytz` or `zoneinfo`). As a result, even if a column in a
+ DataFrame contains datetimes in a single time zone, this will often still result
+ in mixed time zone offsets being written for time zones where daylight saving
+ time is used (e.g. +01:00 and +02:00 offsets for time zone Europe/Brussels). When
+ roundtripping through GDAL, the information about the original time zone is
+ lost, only the offsets can be preserved. By default, `pyogrio.read_dataframe()`
+ will convert columns with mixed offsets to UTC to return a datetime64 column. If
+ you want to preserve the original offsets, you can use `datetime_as_string=True`
+ or `mixed_offsets_as_utc=False`.
+
.. _OGRSQL:
https://gdal.org/user/ogr_sql_dialect.html#ogr-sql-dialect
@@ -267,11 +372,13 @@ def read_dataframe(
read_func = read_arrow if use_arrow else read
gdal_force_2d = False if use_arrow else force_2d
- if not use_arrow:
- # For arrow, datetimes are read as is.
- # For numpy IO, datetimes are read as string values to preserve timezone info
- # as numpy does not directly support timezones.
- kwargs["datetime_as_string"] = True
+
+ # Always read datetimes as string values to preserve (mixed) time zone info
+ # correctly. If arrow is not used, it is needed because numpy does not
+ # directly support time zones + performance is also a lot better. If arrow
+ # is used, needed because datetime columns don't support mixed time zone
+ # offsets + e.g. for .fgb files time zone info isn't handled correctly even
+ # for unique time zone offsets if datetimes are not read as string.
result = read_func(
path_or_buffer,
layer=layer,
@@ -288,6 +395,7 @@ def read_dataframe(
sql=sql,
sql_dialect=sql_dialect,
return_fids=fid_as_index,
+ datetime_as_string=True,
**kwargs,
)
@@ -330,6 +438,26 @@ def read_dataframe(
del table
+ # convert datetime columns that were read as string to datetime
+ for dtype, column in zip(meta["dtypes"], meta["fields"]):
+ if dtype is not None and dtype.startswith("datetime"):
+ df[column] = _try_parse_datetime(
+ df[column], datetime_as_string, mixed_offsets_as_utc
+ )
+ for ogr_subtype, c in zip(meta["ogr_subtypes"], meta["fields"]):
+ if ogr_subtype == "OFSTJSON":
+ # When reading .parquet files with arrow, JSON fields are already
+ # parsed, so only parse if strings.
+ dtype = pd.api.types.infer_dtype(df[c])
+ if dtype == "string":
+ try:
+ df[c] = df[c].map(json.loads, na_action="ignore")
+ except Exception:
+ warnings.warn(
+ f"Could not parse column '{c}' as JSON; leaving as string",
+ stacklevel=2,
+ )
+
if fid_as_index:
df = df.set_index(meta["fid_column"])
df.index.names = ["fid"]
@@ -341,8 +469,18 @@ def read_dataframe(
elif geometry_name in df.columns:
wkb_values = df.pop(geometry_name)
if PANDAS_GE_15 and wkb_values.dtype != object:
- # for example ArrowDtype will otherwise create numpy array with pd.NA
- wkb_values = wkb_values.to_numpy(na_value=None)
+ if (
+ HAS_PYARROW
+ and isinstance(wkb_values.dtype, pd.ArrowDtype)
+ and isinstance(wkb_values.dtype.pyarrow_dtype, pa.BaseExtensionType)
+ ):
+ # handle BaseExtensionType(extension<geoarrow.wkb>)
+ wkb_values = pa.array(wkb_values.array).to_numpy(
+ zero_copy_only=False
+ )
+ else:
+ # for example ArrowDtype will otherwise give numpy array with pd.NA
+ wkb_values = wkb_values.to_numpy(na_value=None)
df["geometry"] = shapely.from_wkb(wkb_values, on_invalid=on_invalid)
if force_2d:
df["geometry"] = shapely.force_2d(df["geometry"])
@@ -361,7 +499,18 @@ def read_dataframe(
df = pd.DataFrame(data, columns=columns, index=index)
for dtype, c in zip(meta["dtypes"], df.columns):
if dtype.startswith("datetime"):
- df[c] = _try_parse_datetime(df[c])
+ df[c] = _try_parse_datetime(df[c], datetime_as_string, mixed_offsets_as_utc)
+ for ogr_subtype, c in zip(meta["ogr_subtypes"], meta["fields"]):
+ if ogr_subtype == "OFSTJSON":
+ dtype = pd.api.types.infer_dtype(df[c])
+ if dtype == "string":
+ try:
+ df[c] = df[c].map(json.loads, na_action="ignore")
+ except Exception:
+ warnings.warn(
+ f"Could not parse column '{c}' as JSON; leaving as string",
+ stacklevel=2,
+ )
if geometry is None or not read_geometry:
return df
@@ -480,6 +629,18 @@ def write_dataframe(
do this (for example if an option exists as both dataset and layer
option).
+ Notes
+ -----
+ When you have datetime columns with time zone information, it is important to
+ note that GDAL only represents time zones as UTC offsets, whilst pandas uses
+ IANA time zones (via `pytz` or `zoneinfo`). As a result, even if a column in a
+ DataFrame contains datetimes in a single time zone, this will often still result
+ in mixed time zone offsets being written for time zones where daylight saving
+ time is used (e.g. +01:00 and +02:00 offsets for time zone Europe/Brussels).
+
+ Object dtype columns containing `datetime` or `pandas.Timestamp` objects will
+ also be written as datetime fields, preserving time zone information where possible.
+
"""
# TODO: add examples to the docstring (e.g. OGR kwargs)
@@ -584,6 +745,7 @@ def write_dataframe(
crs = geometry.crs.to_wkt("WKT1_GDAL")
if use_arrow:
+ import pandas as pd # only called when pandas is known to be installed
import pyarrow as pa
from pyogrio.raw import write_arrow
@@ -619,8 +781,35 @@ def write_dataframe(
df = pd.DataFrame(df, copy=False)
df[geometry_column] = geometry
+ # Arrow doesn't support datetime columns with mixed time zones, and GDAL only
+ # supports time zone offsets. Hence, to avoid data loss, convert columns that
+ # can contain datetime values with different offsets to strings.
+ # Also pass a list of these columns on to GDAL so it can still treat them as
+ # datetime columns when writing the dataset.
+ datetime_cols = []
+ for name, dtype in df.dtypes.items():
+ if dtype == "object":
+ # An object column with datetimes can contain multiple offsets.
+ if pd.api.types.infer_dtype(df[name]) == "datetime":
+ df[name] = df[name].astype("string")
+ datetime_cols.append(name)
+
+ elif isinstance(dtype, pd.DatetimeTZDtype) and str(dtype.tz) != "UTC":
+ # A pd.datetime64 column with a time zone different than UTC can contain
+ # data with different offsets because of summer/winter time.
+ df[name] = df[name].astype("string")
+ datetime_cols.append(name)
+
table = pa.Table.from_pandas(df, preserve_index=False)
+ # Add metadata to datetime columns so GDAL knows they are datetimes.
+ table = _add_column_metadata(
+ table,
+ column_metadata={
+ col: {"GDAL:OGR:type": "DateTime"} for col in datetime_cols
+ },
+ )
+
# Null arrow columns are not supported by GDAL, so convert to string
for field_index, field in enumerate(table.schema):
if field.type == pa.null():
@@ -678,26 +867,39 @@ def write_dataframe(
gdal_tz_offsets = {}
for name in fields:
col = df[name]
+ values = None
+
if isinstance(col.dtype, pd.DatetimeTZDtype):
- # Deal with datetimes with timezones by passing down timezone separately
+ # Deal with datetimes with time zones by passing down time zone separately
# pass down naive datetime
naive = col.dt.tz_localize(None)
values = naive.values
# compute offset relative to UTC explicitly
tz_offset = naive - col.dt.tz_convert("UTC").dt.tz_localize(None)
- # Convert to GDAL timezone offset representation.
+ # Convert to GDAL time zone offset representation.
# GMT is represented as 100 and offsets are represented by adding /
# subtracting 1 for every 15 minutes different from GMT.
# https://gdal.org/development/rfc/rfc56_millisecond_precision.html#core-changes
# Convert each row offset to a signed multiple of 15m and add to GMT value
gdal_offset_representation = tz_offset // pd.Timedelta("15m") + 100
gdal_tz_offsets[name] = gdal_offset_representation.values
- else:
+
+ elif col.dtype == "object":
+ # Column of Timestamp/datetime objects, split in naive datetime and tz.
+ if pd.api.types.infer_dtype(df[name]) == "datetime":
+ tz_offset = col.map(lambda x: x.utcoffset(), na_action="ignore")
+ gdal_offset_repr = tz_offset // pd.Timedelta("15m") + 100
+ gdal_tz_offsets[name] = gdal_offset_repr.values
+ naive = col.map(lambda x: x.replace(tzinfo=None), na_action="ignore")
+ values = naive.values
+
+ if values is None:
values = col.values
+
if isinstance(values, pd.api.extensions.ExtensionArray):
from pandas.arrays import BooleanArray, FloatingArray, IntegerArray
- if isinstance(values, (IntegerArray, FloatingArray, BooleanArray)):
+ if isinstance(values, IntegerArray | FloatingArray | BooleanArray):
field_data.append(values._data)
field_mask.append(values._mask)
else:
@@ -729,3 +931,48 @@ def write_dataframe(
gdal_tz_offsets=gdal_tz_offsets,
**kwargs,
)
+
+
+def _add_column_metadata(table, column_metadata: dict = {}):
+ """Add or update column-level metadata to an arrow table.
+
+ Parameters
+ ----------
+ table : pyarrow.Table
+ The table to add the column metadata to.
+ column_metadata : dict
+ A dictionary with column metadata in the form
+ {
+ "column_1": {"some": "data"},
+ "column_2": {"more": "stuff"},
+ }
+
+ Returns
+ -------
+ pyarrow.Table: table with the updated column metadata.
+ """
+ import pyarrow as pa
+
+ if not column_metadata:
+ return table
+
+ # Create updated column fields with new metadata
+ fields = []
+ for col in table.schema.names:
+ if col in column_metadata:
+ # Add/update column metadata
+ metadata = table.field(col).metadata or {}
+ for key, value in column_metadata[col].items():
+ metadata[key] = value
+ # Update field with updated metadata
+ fields.append(table.field(col).with_metadata(metadata))
+ else:
+ fields.append(table.field(col))
+
+ # Create new schema with the updated field metadata
+ schema = pa.schema(fields, metadata=table.schema.metadata)
+
+ # Build new table with updated schema (shouldn't copy data)
+ table = table.cast(schema)
+
+ return table
=====================================
pyogrio/raw.py
=====================================
@@ -4,7 +4,7 @@ import warnings
from io import BytesIO
from pathlib import Path
-from pyogrio._compat import HAS_ARROW_API, HAS_ARROW_WRITE_API, HAS_PYARROW
+from pyogrio._compat import HAS_ARROW_WRITE_API, HAS_PYARROW
from pyogrio._env import GDALEnv
from pyogrio.core import detect_write_driver
from pyogrio.errors import DataSourceError
@@ -151,7 +151,7 @@ def read(
If True, will return the FIDs of the feature that were read.
datetime_as_string : bool, optional (default: False)
If True, will return datetime dtypes as detected by GDAL as a string
- array (which can be used to extract timezone info), instead of
+ array (which can be used to extract time zone info), instead of
a datetime64 array.
**kwargs
@@ -171,9 +171,11 @@ def read(
Meta is: {
"crs": "<crs>",
"fields": <ndarray of field names>,
- "dtypes": <ndarray of numpy dtypes corresponding to fields>
+ "dtypes": <ndarray of numpy dtypes corresponding to fields>,
+ "ogr_types": <ndarray of OGR types corresponding to fields>,
+ "ogr_subtypes": <ndarray of OGR subtypes corresponding to fields>,
"encoding": "<encoding>",
- "geometry_type": "<geometry type>"
+ "geometry_type": "<geometry type>",
}
.. _OGRSQL:
@@ -233,6 +235,7 @@ def read_arrow(
sql=None,
sql_dialect=None,
return_fids=False,
+ datetime_as_string=False,
**kwargs,
):
"""Read OGR data source into a pyarrow Table.
@@ -249,9 +252,13 @@ def read_arrow(
Meta is: {
"crs": "<crs>",
"fields": <ndarray of field names>,
+ "dtypes": <ndarray of numpy dtypes corresponding to fields>,
+ "ogr_types": <ndarray of OGR types corresponding to fields>,
+ "ogr_subtypes": <ndarray of OGR subtypes corresponding to fields>,
"encoding": "<encoding>",
"geometry_type": "<geometry_type>",
"geometry_name": "<name of geometry column in arrow table>",
+ "fid_column": "<name of FID column in arrow table>"
}
"""
@@ -303,6 +310,7 @@ def read_arrow(
skip_features=gdal_skip_features,
batch_size=batch_size,
use_pyarrow=True,
+ datetime_as_string=datetime_as_string,
**kwargs,
) as source:
meta, reader = source
@@ -358,6 +366,7 @@ def open_arrow(
return_fids=False,
batch_size=65_536,
use_pyarrow=False,
+ datetime_as_string=False,
**kwargs,
):
"""Open OGR data source as a stream of Arrow record batches.
@@ -386,6 +395,9 @@ def open_arrow(
ArrowStream object. In the default case, this stream object needs
to be passed to another library supporting the Arrow PyCapsule
Protocol to consume the stream of data.
+ datetime_as_string : bool, optional (default: False)
+ If True, will return datetime dtypes as detected by GDAL as strings,
+ as Arrow doesn't support e.g. mixed time zones.
Examples
--------
@@ -423,15 +435,16 @@ def open_arrow(
Meta is: {
"crs": "<crs>",
"fields": <ndarray of field names>,
+ "dtypes": <ndarray of numpy dtypes corresponding to fields>,
+ "ogr_types": <ndarray of OGR types corresponding to fields>,
+ "ogr_subtypes": <ndarray of OGR subtypes corresponding to fields>,
"encoding": "<encoding>",
"geometry_type": "<geometry_type>",
"geometry_name": "<name of geometry column in arrow table>",
+ "fid_column": "<name of FID column in arrow table>"
}
"""
- if not HAS_ARROW_API:
- raise RuntimeError("GDAL>= 3.6 required to read using arrow")
-
dataset_kwargs = _preprocess_options_key_value(kwargs) if kwargs else {}
return ogr_open_arrow(
@@ -453,6 +466,7 @@ def open_arrow(
dataset_kwargs=dataset_kwargs,
batch_size=batch_size,
use_pyarrow=use_pyarrow,
+ datetime_as_string=datetime_as_string,
)
@@ -575,12 +589,6 @@ def _get_write_path_driver(path, driver, append=False):
f"{get_gdal_version_string()}"
)
- # prevent segfault from: https://github.com/OSGeo/gdal/issues/5739
- if append and driver == "FlatGeobuf" and get_gdal_version() <= (3, 5, 0):
- raise RuntimeError(
- "append to FlatGeobuf is not supported for GDAL <= 3.5.0 due to segfault"
- )
-
return path, driver
@@ -685,15 +693,17 @@ def write(
Layer creation options (format specific) passed to OGR. Specify as
a key-value dictionary.
gdal_tz_offsets : dict, optional (default: None)
- Used to handle GDAL timezone offsets for each field contained in dict.
+ Used to handle GDAL time zone offsets for each field contained in dict.
**kwargs
Additional driver-specific dataset creation options passed to OGR. Invalid
options will trigger a warning.
"""
- # if dtypes is given, remove it from kwargs (dtypes is included in meta returned by
+ # remove some unneeded kwargs (e.g. dtypes is included in meta returned by
# read, and it is convenient to pass meta directly into write for round trip tests)
kwargs.pop("dtypes", None)
+ kwargs.pop("ogr_types", None)
+ kwargs.pop("ogr_subtypes", None)
path, driver = _get_write_path_driver(path, driver, append=append)
=====================================
pyogrio/tests/conftest.py
=====================================
@@ -1,16 +1,14 @@
+"""Module with helper functions, fixtures, and common test data for pyogrio tests."""
+
from io import BytesIO
from pathlib import Path
from zipfile import ZIP_DEFLATED, ZipFile
import numpy as np
-from pyogrio import (
- __gdal_version_string__,
- __version__,
- list_drivers,
-)
+from pyogrio import __gdal_version_string__, __version__, list_drivers
from pyogrio._compat import (
- HAS_ARROW_API,
+ GDAL_GE_37,
HAS_ARROW_WRITE_API,
HAS_GDAL_GEOS,
HAS_PYARROW,
@@ -51,6 +49,8 @@ START_FID = {
".shp": 0,
}
+GDAL_HAS_PARQUET_DRIVER = "Parquet" in list_drivers()
+
def pytest_report_header(config):
drivers = ", ".join(
@@ -65,10 +65,7 @@ def pytest_report_header(config):
# marks to skip tests if optional dependecies are not present
-requires_arrow_api = pytest.mark.skipif(not HAS_ARROW_API, reason="GDAL>=3.6 required")
-requires_pyarrow_api = pytest.mark.skipif(
- not HAS_ARROW_API or not HAS_PYARROW, reason="GDAL>=3.6 and pyarrow required"
-)
+requires_pyarrow_api = pytest.mark.skipif(not HAS_PYARROW, reason="pyarrow required")
requires_pyproj = pytest.mark.skipif(not HAS_PYPROJ, reason="pyproj required")
@@ -85,6 +82,9 @@ requires_shapely = pytest.mark.skipif(not HAS_SHAPELY, reason="Shapely >= 2.0 re
def prepare_testfile(testfile_path, dst_dir, ext):
+ if ext == ".gpkg.zip" and not GDAL_GE_37:
+ pytest.skip(".gpkg.zip support requires GDAL >= 3.7")
+
if ext == testfile_path.suffix:
return testfile_path
@@ -100,7 +100,7 @@ def prepare_testfile(testfile_path, dst_dir, ext):
# allow mixed Polygons/MultiPolygons type
meta["geometry_type"] = "Unknown"
- elif ext == ".gpkg":
+ elif ext in (".gpkg", ".gpkg.zip"):
# For .gpkg, spatial_index=False to avoid the rows being reordered
meta["spatial_index"] = False
meta["geometry_type"] = "MultiPolygon"
@@ -201,36 +201,70 @@ def no_geometry_file(tmp_path):
return filename
- at pytest.fixture(scope="function")
-def list_field_values_file(tmp_path):
+def list_field_values_geojson_file(tmp_path):
# Create a GeoJSON file with list values in a property
list_geojson = """{
"type": "FeatureCollection",
"features": [
{
"type": "Feature",
- "properties": { "int64": 1, "list_int64": [0, 1] },
+ "properties": {
+ "int": 1,
+ "list_int": [0, 1],
+ "list_double": [0.0, 1.0],
+ "list_string": ["string1", "string2"],
+ "list_int_with_null": [0, null],
+ "list_string_with_null": ["string1", null]
+ },
"geometry": { "type": "Point", "coordinates": [0, 2] }
},
{
"type": "Feature",
- "properties": { "int64": 2, "list_int64": [2, 3] },
+ "properties": {
+ "int": 2,
+ "list_int": [2, 3],
+ "list_double": [2.0, 3.0],
+ "list_string": ["string3", "string4", ""],
+ "list_int_with_null": [2, 3],
+ "list_string_with_null": ["string3", "string4", ""]
+ },
"geometry": { "type": "Point", "coordinates": [1, 2] }
},
{
"type": "Feature",
- "properties": { "int64": 3, "list_int64": [4, 5] },
+ "properties": {
+ "int": 3,
+ "list_int": [],
+ "list_double": [],
+ "list_string": [],
+ "list_int_with_null": [],
+ "list_string_with_null": []
+ },
"geometry": { "type": "Point", "coordinates": [2, 2] }
},
{
"type": "Feature",
- "properties": { "int64": 4, "list_int64": [6, 7] },
- "geometry": { "type": "Point", "coordinates": [3, 2] }
+ "properties": {
+ "int": 4,
+ "list_int": null,
+ "list_double": null,
+ "list_string": null,
+ "list_int_with_null": null,
+ "list_string_with_null": null
+ },
+ "geometry": { "type": "Point", "coordinates": [2, 2] }
},
{
"type": "Feature",
- "properties": { "int64": 5, "list_int64": [8, 9] },
- "geometry": { "type": "Point", "coordinates": [4, 2] }
+ "properties": {
+ "int": 5,
+ "list_int": null,
+ "list_double": null,
+ "list_string": [""],
+ "list_int_with_null": null,
+ "list_string_with_null": [""]
+ },
+ "geometry": { "type": "Point", "coordinates": [2, 2] }
}
]
}"""
@@ -242,6 +276,66 @@ def list_field_values_file(tmp_path):
return filename
+def list_field_values_parquet_file():
+ """Return the path to a Parquet file with list values in a property.
+
+ Because in the CI environments pyarrow.parquet is typically not available, we save
+ the file in the test data directory instead of always creating it from scratch.
+
+ The code to create it is here though, in case it needs to be recreated later.
+ """
+ # Check if the file already exists in the test data dir
+ fixture_path = _data_dir / "list_field_values_file.parquet"
+ if fixture_path.exists():
+ return fixture_path
+
+ # The file doesn't exist, so create it
+ try:
+ import pyarrow as pa
+ from pyarrow import parquet as pq
+
+ import shapely
+ except ImportError as ex:
+ raise RuntimeError(
+ f"test file {fixture_path} does not exist, but error importing: {ex}."
+ )
+
+ table = pa.table(
+ {
+ "geometry": shapely.to_wkb(shapely.points(np.ones((5, 2)))),
+ "int": [1, 2, 3, 4, 5],
+ "list_int": [[0, 1], [2, 3], [], None, None],
+ "list_double": [[0.0, 1.0], [2.0, 3.0], [], None, None],
+ "list_string": [
+ ["string1", "string2"],
+ ["string3", "string4", ""],
+ [],
+ None,
+ [""],
+ ],
+ "list_int_with_null": [[0, None], [2, 3], [], None, None],
+ "list_string_with_null": [
+ ["string1", None],
+ ["string3", "string4", ""],
+ [],
+ None,
+ [""],
+ ],
+ }
+ )
+ pq.write_table(table, fixture_path)
+
+ return fixture_path
+
+
+ at pytest.fixture(scope="function", params=[".geojson", ".parquet"])
+def list_field_values_files(tmp_path, request):
+ if request.param == ".geojson":
+ return list_field_values_geojson_file(tmp_path)
+ elif request.param == ".parquet":
+ return list_field_values_parquet_file()
+
+
@pytest.fixture(scope="function")
def nested_geojson_file(tmp_path):
# create GeoJSON file with nested properties
@@ -271,6 +365,45 @@ def nested_geojson_file(tmp_path):
return filename
+ at pytest.fixture(scope="function")
+def list_nested_struct_parquet_file(tmp_path):
+ """Create a Parquet file in tmp_path with nested values in a property.
+
+ Because in the CI environments pyarrow.parquet is typically not available, we save
+ the file in the test data directory instead of always creating it from scratch.
+
+ The code to create it is here though, in case it needs to be recreated later.
+ """
+ # Check if the file already exists in the test data dir
+ fixture_path = _data_dir / "list_nested_struct_file.parquet"
+ if fixture_path.exists():
+ return fixture_path
+
+ # The file doesn't exist, so create it
+ try:
+ import pyarrow as pa
+ from pyarrow import parquet as pq
+
+ import shapely
+ except ImportError as ex:
+ raise RuntimeError(
+ f"test file {fixture_path} does not exist, but error importing: {ex}."
+ )
+
+ table = pa.table(
+ {
+ "geometry": shapely.to_wkb(shapely.points(np.ones((3, 2)))),
+ "col_flat": [0, 1, 2],
+ "col_struct": [{"a": 1, "b": 2}] * 3,
+ "col_nested": [[{"a": 1, "b": 2}] * 2] * 3,
+ "col_list": [[1, 2, 3]] * 3,
+ }
+ )
+ pq.write_table(table, fixture_path)
+
+ return fixture_path
+
+
@pytest.fixture(scope="function")
def datetime_file(tmp_path):
# create GeoJSON file with millisecond precision
@@ -299,7 +432,7 @@ def datetime_file(tmp_path):
@pytest.fixture(scope="function")
def datetime_tz_file(tmp_path):
- # create GeoJSON file with datetimes with timezone
+ # create GeoJSON file with datetimes with time zone
datetime_tz_geojson = """{
"type": "FeatureCollection",
"features": [
@@ -340,6 +473,27 @@ def geojson_bytes(tmp_path):
return bytes_buffer
+ at pytest.fixture(scope="function")
+def geojson_datetime_long_ago(tmp_path):
+ # create GeoJSON file with datetimes from long ago
+ datetime_tz_geojson = """{
+ "type": "FeatureCollection",
+ "features": [
+ {
+ "type": "Feature",
+ "properties": { "datetime_col": "1670-01-01T09:00:00" },
+ "geometry": { "type": "Point", "coordinates": [1, 1] }
+ }
+ ]
+ }"""
+
+ filename = tmp_path / "test_datetime_long_ago.geojson"
+ with open(filename, "w") as f:
+ f.write(datetime_tz_geojson)
+
+ return filename
+
+
@pytest.fixture(scope="function")
def geojson_filelike(tmp_path):
"""Extracts first 3 records from naturalearth_lowres and writes to GeoJSON,
@@ -355,6 +509,34 @@ def geojson_filelike(tmp_path):
yield f
+ at pytest.fixture(scope="function")
+def kml_file(tmp_path):
+ # create KML file
+ kml_data = """<?xml version="1.0" encoding="utf-8" ?>
+ <kml xmlns="http://www.opengis.net/kml/2.2">
+ <Document id="root_doc">
+ <Schema name="interfaces1" id="interfaces1">
+ <SimpleField name="id" type="float"></SimpleField>
+ <SimpleField name="formation" type="string"></SimpleField>
+ </Schema>
+ <Folder><name>interfaces1</name>
+ <Placemark>
+ <ExtendedData><SchemaData schemaUrl="#interfaces1">
+ <SimpleData name="formation">Ton</SimpleData>
+ </SchemaData></ExtendedData>
+ <Point><coordinates>19.1501280458077,293.313485355882</coordinates></Point>
+ </Placemark>
+ </Folder>
+ </Document>
+ </kml>
+ """
+ filename = tmp_path / "test.kml"
+ with open(filename, "w") as f:
+ _ = f.write(kml_data)
+
+ return filename
+
+
@pytest.fixture(scope="function")
def nonseekable_bytes(tmp_path):
# mock a non-seekable byte stream, such as a zstandard handle
=====================================
pyogrio/tests/fixtures/list_field_values_file.parquet
=====================================
Binary files /dev/null and b/pyogrio/tests/fixtures/list_field_values_file.parquet differ
=====================================
pyogrio/tests/fixtures/list_nested_struct_file.parquet
=====================================
Binary files /dev/null and b/pyogrio/tests/fixtures/list_nested_struct_file.parquet differ
=====================================
pyogrio/tests/test_arrow.py
=====================================
@@ -1,7 +1,6 @@
import contextlib
import json
import math
-import os
import sys
from io import BytesIO
from packaging.version import Version
@@ -133,13 +132,6 @@ def test_read_arrow_ignore_geometry(naturalearth_lowres):
assert_frame_equal(result, expected)
-def test_read_arrow_nested_types(list_field_values_file):
- # with arrow, list types are supported
- result = read_dataframe(list_field_values_file, use_arrow=True)
- assert "list_int64" in result.columns
- assert result["list_int64"][0].tolist() == [0, 1]
-
-
def test_read_arrow_to_pandas_kwargs(no_geometry_file):
# with arrow, list types are supported
arrow_to_pandas_kwargs = {"strings_to_categorical": True}
@@ -300,29 +292,6 @@ def test_open_arrow_capsule_protocol_without_pyarrow(naturalearth_lowres):
assert result.equals(expected)
- at contextlib.contextmanager
-def use_arrow_context():
- original = os.environ.get("PYOGRIO_USE_ARROW", None)
- os.environ["PYOGRIO_USE_ARROW"] = "1"
- yield
- if original:
- os.environ["PYOGRIO_USE_ARROW"] = original
- else:
- del os.environ["PYOGRIO_USE_ARROW"]
-
-
-def test_enable_with_environment_variable(list_field_values_file):
- # list types are only supported with arrow, so don't work by default and work
- # when arrow is enabled through env variable
- result = read_dataframe(list_field_values_file)
- assert "list_int64" not in result.columns
-
- with use_arrow_context():
- result = read_dataframe(list_field_values_file)
-
- assert "list_int64" in result.columns
-
-
@pytest.mark.skipif(
__gdal_version__ < (3, 8, 3), reason="Arrow bool value bug fixed in GDAL >= 3.8.3"
)
@@ -507,10 +476,6 @@ def test_write_geojson(tmp_path, naturalearth_lowres):
@requires_arrow_write_api
- at pytest.mark.skipif(
- __gdal_version__ < (3, 6, 0),
- reason="OpenFileGDB write support only available for GDAL >= 3.6.0",
-)
@pytest.mark.parametrize(
"write_int64",
[
@@ -674,7 +639,7 @@ def test_write_append(request, tmp_path, naturalearth_lowres, ext):
assert read_info(filename)["features"] == 354
- at pytest.mark.parametrize("driver,ext", [("GML", ".gml"), ("GeoJSONSeq", ".geojsons")])
+ at pytest.mark.parametrize("driver,ext", [("GML", ".gml")])
@requires_arrow_write_api
def test_write_append_unsupported(tmp_path, naturalearth_lowres, driver, ext):
meta, table = read_arrow(naturalearth_lowres)
@@ -992,9 +957,6 @@ def test_write_memory_driver_required(naturalearth_lowres):
@requires_arrow_write_api
@pytest.mark.parametrize("driver", ["ESRI Shapefile", "OpenFileGDB"])
def test_write_memory_unsupported_driver(naturalearth_lowres, driver):
- if driver == "OpenFileGDB" and __gdal_version__ < (3, 6, 0):
- pytest.skip("OpenFileGDB write support only available for GDAL >= 3.6.0")
-
meta, table = read_arrow(naturalearth_lowres, max_features=1)
buffer = BytesIO()
=====================================
pyogrio/tests/test_core.py
=====================================
@@ -22,7 +22,12 @@ from pyogrio._compat import GDAL_GE_38
from pyogrio._env import GDALEnv
from pyogrio.errors import DataLayerError, DataSourceError
from pyogrio.raw import read, write
-from pyogrio.tests.conftest import START_FID, prepare_testfile, requires_shapely
+from pyogrio.tests.conftest import (
+ DRIVERS,
+ START_FID,
+ prepare_testfile,
+ requires_shapely,
+)
import pytest
@@ -135,11 +140,7 @@ def test_list_drivers():
# verify that the core drivers are present
for name in ("ESRI Shapefile", "GeoJSON", "GeoJSONSeq", "GPKG", "OpenFileGDB"):
assert name in all_drivers
-
expected_capability = "rw"
- if name == "OpenFileGDB" and __gdal_version__ < (3, 6, 0):
- expected_capability = "r"
-
assert all_drivers[name] == expected_capability
drivers = list_drivers(read=True)
@@ -391,10 +392,6 @@ def test_read_bounds_mask(naturalearth_lowres_all_ext, mask, expected):
assert array_equal(fids, fids_expected)
- at pytest.mark.skipif(
- __gdal_version__ < (3, 4, 0),
- reason="Cannot determine if GEOS is present or absent for GDAL < 3.4",
-)
def test_read_bounds_bbox_intersects_vs_envelope_overlaps(naturalearth_lowres_all_ext):
# If GEOS is present and used by GDAL, bbox filter will be based on intersection
# of bbox and actual geometries; if GEOS is absent or not used by GDAL, it
@@ -415,7 +412,9 @@ def test_read_bounds_bbox_intersects_vs_envelope_overlaps(naturalearth_lowres_al
assert array_equal(fids, fids_expected)
- at pytest.mark.parametrize("naturalearth_lowres", [".shp", ".gpkg"], indirect=True)
+ at pytest.mark.parametrize(
+ "naturalearth_lowres", [".shp", ".shp.zip", ".gpkg", ".gpkg.zip"], indirect=True
+)
def test_read_info(naturalearth_lowres):
meta = read_info(naturalearth_lowres)
@@ -427,11 +426,12 @@ def test_read_info(naturalearth_lowres):
assert meta["features"] == 177
assert allclose(meta["total_bounds"], (-180, -90, 180, 83.64513))
assert meta["capabilities"]["random_read"] is True
+ # The GPKG test files are created without spatial index
assert meta["capabilities"]["fast_spatial_filter"] is False
assert meta["capabilities"]["fast_feature_count"] is True
assert meta["capabilities"]["fast_total_bounds"] is True
- if naturalearth_lowres.suffix == ".gpkg":
+ if naturalearth_lowres.name.endswith((".gpkg", ".gpkg.zip")):
assert meta["fid_column"] == "fid"
assert meta["geometry_name"] == "geom"
assert meta["geometry_type"] == "MultiPolygon"
@@ -439,7 +439,7 @@ def test_read_info(naturalearth_lowres):
if GDAL_GE_38:
# this capability is only True for GPKG if GDAL >= 3.8
assert meta["capabilities"]["fast_set_next_by_index"] is True
- elif naturalearth_lowres.suffix == ".shp":
+ elif naturalearth_lowres.name.endswith((".shp", ".shp.zip")):
# fid_column == "" for formats where fid is not physically stored
assert meta["fid_column"] == ""
# geometry_name == "" for formats where geometry column name cannot be
@@ -452,6 +452,14 @@ def test_read_info(naturalearth_lowres):
raise ValueError(f"test not implemented for ext {naturalearth_lowres.suffix}")
+ at pytest.mark.parametrize(
+ "naturalearth_lowres", [*DRIVERS.keys(), ".sqlite"], indirect=True
+)
+def test_read_info_encoding(naturalearth_lowres):
+ meta = read_info(naturalearth_lowres)
+ assert meta["encoding"].upper() == "UTF-8"
+
+
@pytest.mark.parametrize(
"testfile", ["naturalearth_lowres_vsimem", "naturalearth_lowres_vsi"]
)
@@ -567,12 +575,24 @@ def test_read_info_force_total_bounds(
assert info["total_bounds"] is None
+def test_read_info_jsonfield(nested_geojson_file):
+ """Test if JSON fields types are returned correctly."""
+ meta = read_info(nested_geojson_file)
+ assert meta["ogr_types"] == ["OFTString", "OFTString"]
+ assert meta["ogr_subtypes"] == ["OFSTNone", "OFSTJSON"]
+
+
def test_read_info_unspecified_layer_warning(data_dir):
"""Reading a multi-layer file without specifying a layer gives a warning."""
with pytest.warns(UserWarning, match="More than one layer found "):
read_info(data_dir / "sample.osm.pbf")
+def test_read_info_invalid_layer(naturalearth_lowres):
+ with pytest.raises(ValueError, match="'layer' parameter must be a str or int"):
+ read_bounds(naturalearth_lowres, layer=["list_arg_is_invalid"])
+
+
def test_read_info_without_geometry(no_geometry_file):
assert read_info(no_geometry_file)["total_bounds"] is None
=====================================
pyogrio/tests/test_geopandas_io.py
=====================================
@@ -1,5 +1,7 @@
import contextlib
import locale
+import os
+import re
import warnings
from datetime import datetime
from io import BytesIO
@@ -19,10 +21,10 @@ from pyogrio import (
from pyogrio._compat import (
GDAL_GE_37,
GDAL_GE_311,
- GDAL_GE_352,
HAS_ARROW_WRITE_API,
HAS_PYPROJ,
PANDAS_GE_15,
+ PANDAS_GE_23,
PANDAS_GE_30,
SHAPELY_GE_21,
)
@@ -35,6 +37,7 @@ from pyogrio.raw import (
from pyogrio.tests.conftest import (
ALL_EXTS,
DRIVERS,
+ GDAL_HAS_PARQUET_DRIVER,
START_FID,
requires_arrow_write_api,
requires_gdal_geos,
@@ -48,6 +51,7 @@ try:
import geopandas as gp
import pandas as pd
from geopandas.array import from_wkt
+ from pandas.api.types import is_datetime64_dtype, is_object_dtype, is_string_dtype
import shapely # if geopandas is present, shapely is expected to be present
from shapely.geometry import Point
@@ -93,14 +97,22 @@ def skip_if_no_arrow_write_api(request):
pytest.skip("GDAL>=3.8 required for Arrow write API")
-def spatialite_available(path):
- try:
- _ = read_dataframe(
- path, sql="select spatialite_version();", sql_dialect="SQLITE"
- )
- return True
- except Exception:
- return False
+ at contextlib.contextmanager
+def use_arrow_context():
+ original = os.environ.get("PYOGRIO_USE_ARROW", None)
+ os.environ["PYOGRIO_USE_ARROW"] = "1"
+ yield
+ if original:
+ os.environ["PYOGRIO_USE_ARROW"] = original
+ else:
+ del os.environ["PYOGRIO_USE_ARROW"]
+
+
+def test_spatialite_available(test_gpkg_nulls):
+ """Check if SpatiaLite is available by running a simple SQL query."""
+ _ = read_dataframe(
+ test_gpkg_nulls, sql="select spatialite_version();", sql_dialect="SQLITE"
+ )
@pytest.mark.parametrize(
@@ -259,10 +271,6 @@ def test_read_force_2d(tmp_path, use_arrow):
assert not df.iloc[0].geometry.has_z
- at pytest.mark.skipif(
- not GDAL_GE_352,
- reason="gdal >= 3.5.2 needed to use OGR_GEOJSON_MAX_OBJ_SIZE with a float value",
-)
def test_read_geojson_error(naturalearth_lowres_geojson, use_arrow):
try:
set_gdal_config_options({"OGR_GEOJSON_MAX_OBJ_SIZE": 0.01})
@@ -275,6 +283,22 @@ def test_read_geojson_error(naturalearth_lowres_geojson, use_arrow):
set_gdal_config_options({"OGR_GEOJSON_MAX_OBJ_SIZE": None})
+ at pytest.mark.skipif(
+ "LIBKML" not in list_drivers(),
+ reason="LIBKML driver is not available and is needed to read simpledata element",
+)
+def test_read_kml_simpledata(kml_file, use_arrow):
+ """Test reading a KML file with a simpledata element.
+
+ Simpledata elements are only read by the LibKML driver, not the KML driver.
+ """
+ gdf = read_dataframe(kml_file, use_arrow=use_arrow)
+
+ # Check if the simpledata column is present.
+ assert "formation" in gdf.columns
+ assert gdf["formation"].iloc[0] == "Ton"
+
+
def test_read_layer(tmp_path, use_arrow):
filename = tmp_path / "test.gpkg"
@@ -333,29 +357,162 @@ def test_read_datetime(datetime_file, use_arrow):
assert df.col.dtype.name == "datetime64[ns]"
- at pytest.mark.filterwarnings("ignore: Non-conformant content for record 1 in column ")
- at pytest.mark.requires_arrow_write_api
-def test_read_datetime_tz(datetime_tz_file, tmp_path, use_arrow):
- df = read_dataframe(datetime_tz_file)
- # Make the index non-consecutive to test this case as well. Added for issue
- # https://github.com/geopandas/pyogrio/issues/324
- df = df.set_index(np.array([0, 2]))
- raw_expected = ["2020-01-01T09:00:00.123-05:00", "2020-01-01T10:00:00-05:00"]
+def test_read_list_types(list_field_values_files, use_arrow):
+ """Test reading a geojson file containing fields with lists."""
+ if list_field_values_files.suffix == ".parquet" and not GDAL_HAS_PARQUET_DRIVER:
+ pytest.skip(
+ "Skipping test for parquet as the GDAL Parquet driver is not available"
+ )
- if PANDAS_GE_20:
- expected = pd.to_datetime(raw_expected, format="ISO8601").as_unit("ms")
+ info = read_info(list_field_values_files)
+ suffix = list_field_values_files.suffix
+
+ result = read_dataframe(list_field_values_files, use_arrow=use_arrow)
+
+ # Check list_int column
+ assert "list_int" in result.columns
+ assert info["fields"][1] == "list_int"
+ assert info["ogr_types"][1] in ("OFTIntegerList", "OFTInteger64List")
+ assert result["list_int"][0].tolist() == [0, 1]
+ assert result["list_int"][1].tolist() == [2, 3]
+ assert result["list_int"][2].tolist() == []
+ assert result["list_int"][3] is None
+ assert result["list_int"][4] is None
+
+ # Check list_double column
+ assert "list_double" in result.columns
+ assert info["fields"][2] == "list_double"
+ assert info["ogr_types"][2] == "OFTRealList"
+ assert result["list_double"][0].tolist() == [0.0, 1.0]
+ assert result["list_double"][1].tolist() == [2.0, 3.0]
+ assert result["list_double"][2].tolist() == []
+ assert result["list_double"][3] is None
+ assert result["list_double"][4] is None
+
+ # Check list_string column
+ assert "list_string" in result.columns
+ assert info["fields"][3] == "list_string"
+ assert info["ogr_types"][3] == "OFTStringList"
+ assert result["list_string"][0].tolist() == ["string1", "string2"]
+ assert result["list_string"][1].tolist() == ["string3", "string4", ""]
+ assert result["list_string"][2].tolist() == []
+ assert result["list_string"][3] is None
+ assert result["list_string"][4] == [""]
+
+ # Check list_int_with_null column
+ if suffix == ".geojson":
+ # Once any row of a column contains a null value in a list, the column isn't
+ # recognized as a list column anymore for .geojson files, but as a JSON column.
+ # Because JSON columns containing JSON Arrays are also parsed to python lists,
+ # the end result is the same...
+ exp_type = "OFTString"
+ exp_subtype = "OFSTJSON"
+ exp_list_int_with_null_value = [0, None]
else:
- expected = pd.to_datetime(raw_expected)
- expected = pd.Series(expected, name="datetime_col")
- assert_series_equal(df.datetime_col, expected, check_index=False)
- # test write and read round trips
- fpath = tmp_path / "test.gpkg"
- write_dataframe(df, fpath, use_arrow=use_arrow)
- df_read = read_dataframe(fpath, use_arrow=use_arrow)
- if use_arrow:
- # with Arrow, the datetimes are always read as UTC
- expected = expected.dt.tz_convert("UTC")
- assert_series_equal(df_read.datetime_col, expected)
+ # For .parquet files, the list column is preserved as a list column.
+ exp_type = "OFTInteger64List"
+ exp_subtype = "OFSTNone"
+ if use_arrow:
+ exp_list_int_with_null_value = [0.0, np.nan]
+ else:
+ exp_list_int_with_null_value = [0, 0]
+ # xfail: when reading a list of int with None values without Arrow from a
+ # .parquet file, the None values become 0, which is wrong.
+ # https://github.com/OSGeo/gdal/issues/13448
+
+ assert "list_int_with_null" in result.columns
+ assert info["fields"][4] == "list_int_with_null"
+ assert info["ogr_types"][4] == exp_type
+ assert info["ogr_subtypes"][4] == exp_subtype
+ assert result["list_int_with_null"][0][0] == 0
+ if exp_list_int_with_null_value[1] == 0:
+ assert result["list_int_with_null"][0][1] == exp_list_int_with_null_value[1]
+ else:
+ assert pd.isna(result["list_int_with_null"][0][1])
+
+ if suffix == ".geojson":
+ # For .geojson, the lists are already python lists
+ assert result["list_int_with_null"][1] == [2, 3]
+ assert result["list_int_with_null"][2] == []
+ else:
+ # For .parquet, the lists are numpy arrays
+ assert result["list_int_with_null"][1].tolist() == [2, 3]
+ assert result["list_int_with_null"][2].tolist() == []
+
+ assert pd.isna(result["list_int_with_null"][3])
+ assert pd.isna(result["list_int_with_null"][4])
+
+ # Check list_string_with_null column
+ if suffix == ".geojson":
+ # Once any row of a column contains a null value in a list, the column isn't
+ # recognized as a list column anymore for .geojson files, but as a JSON column.
+ # Because JSON columns containing JSON Arrays are also parsed to python lists,
+ # the end result is the same...
+ exp_type = "OFTString"
+ exp_subtype = "OFSTJSON"
+ else:
+ # For .parquet files, the list column is preserved as a list column.
+ exp_type = "OFTStringList"
+ exp_subtype = "OFSTNone"
+
+ assert "list_string_with_null" in result.columns
+ assert info["fields"][5] == "list_string_with_null"
+ assert info["ogr_types"][5] == exp_type
+ assert info["ogr_subtypes"][5] == exp_subtype
+
+ if suffix == ".geojson":
+ # For .geojson, the lists are already python lists
+ assert result["list_string_with_null"][0] == ["string1", None]
+ assert result["list_string_with_null"][1] == ["string3", "string4", ""]
+ assert result["list_string_with_null"][2] == []
+ else:
+ # For .parquet, the lists are numpy arrays
+ # When use_arrow=False, the None becomes an empty string, which is wrong.
+ exp_value = ["string1", ""] if not use_arrow else ["string1", None]
+ assert result["list_string_with_null"][0].tolist() == exp_value
+ assert result["list_string_with_null"][1].tolist() == ["string3", "string4", ""]
+ assert result["list_string_with_null"][2].tolist() == []
+
+ assert pd.isna(result["list_string_with_null"][3])
+ assert result["list_string_with_null"][4] == [""]
+
+
+ at pytest.mark.requires_arrow_write_api
+ at pytest.mark.skipif(
+ not GDAL_HAS_PARQUET_DRIVER, reason="Parquet driver is not available"
+)
+def test_read_list_nested_struct_parquet_file(
+ list_nested_struct_parquet_file, use_arrow
+):
+ """Test reading a Parquet file containing nested struct and list types."""
+ if not use_arrow:
+ pytest.skip(
+ "When use_arrow=False, gdal flattens nested columns to seperate columns. "
+ "Not sure how we want to deal with this case, but for now just skip."
+ )
+
+ result = read_dataframe(list_nested_struct_parquet_file, use_arrow=use_arrow)
+
+ assert "col_flat" in result.columns
+ assert np.array_equal(result["col_flat"].to_numpy(), np.array([0, 1, 2]))
+
+ assert "col_list" in result.columns
+ assert result["col_list"].dtype == object
+ assert result["col_list"][0].tolist() == [1, 2, 3]
+ assert result["col_list"][1].tolist() == [1, 2, 3]
+ assert result["col_list"][2].tolist() == [1, 2, 3]
+
+ assert "col_nested" in result.columns
+ assert result["col_nested"].dtype == object
+ assert result["col_nested"][0].tolist() == [{"a": 1, "b": 2}, {"a": 1, "b": 2}]
+ assert result["col_nested"][1].tolist() == [{"a": 1, "b": 2}, {"a": 1, "b": 2}]
+ assert result["col_nested"][2].tolist() == [{"a": 1, "b": 2}, {"a": 1, "b": 2}]
+
+ assert "col_struct" in result.columns
+ assert result["col_struct"].dtype == object
+ assert result["col_struct"][0] == {"a": 1, "b": 2}
+ assert result["col_struct"][1] == {"a": 1, "b": 2}
+ assert result["col_struct"][2] == {"a": 1, "b": 2}
@pytest.mark.filterwarnings(
@@ -371,39 +528,511 @@ def test_write_datetime_mixed_offset(tmp_path, use_arrow):
if PANDAS_GE_20:
utc_col = utc_col.dt.as_unit("ms")
+
+ at pytest.mark.parametrize("datetime_as_string", [False, True])
+ at pytest.mark.parametrize("mixed_offsets_as_utc", [False, True])
+def test_read_datetime_long_ago(
+ geojson_datetime_long_ago, use_arrow, mixed_offsets_as_utc, datetime_as_string
+):
+ """Test writing/reading a column with a datetime far in the past.
+ Dates from before 1678-1-1 aren't parsed correctly by pandas < 3.0, so they
+ stay strings.
+ Reported in https://github.com/geopandas/pyogrio/issues/553.
+ """
+ handler = contextlib.nullcontext()
+ overflow_occured = False
+ if not datetime_as_string and not PANDAS_GE_30 and (not use_arrow or GDAL_GE_311):
+ # When datetimes should not be returned as string and arrow is not used or
+ # arrow is used with GDAL >= 3.11, `pandas.to_datetime` is used to parse the
+ # datetimes. However, when using pandas < 3.0, this raises an
+ # "Out of bounds nanosecond timestamp" error for very old dates.
+ # As a result, `read_dataframe` gives a warning and the datetimes stay strings.
+ handler = pytest.warns(
+ UserWarning, match="Error parsing datetimes, original strings are returned"
+ )
+ overflow_occured = True
+ # XFAIL: datetimes before 1678-1-1 give overflow with arrow=False and pandas<3.0
+ elif use_arrow and not PANDAS_GE_20 and not GDAL_GE_311:
+ # When arrow is used with pandas < 2.0 and GDAL < 3.11, an overflow occurs in
+ # pyarrow.to_pandas().
+ handler = pytest.raises(
+ Exception,
+ match=re.escape("Casting from timestamp[ms] to timestamp[ns] would result"),
+ )
+ overflow_occured = True
+ # XFAIL: datetimes before 1678-1-1 give overflow with arrow=True and pandas<2.0
+
+ with handler:
+ df = read_dataframe(
+ geojson_datetime_long_ago,
+ use_arrow=use_arrow,
+ datetime_as_string=datetime_as_string,
+ mixed_offsets_as_utc=mixed_offsets_as_utc,
+ )
+
+ exp_dates_str = pd.Series(["1670-01-01T09:00:00"], name="datetime_col")
+ if datetime_as_string:
+ assert is_string_dtype(df.datetime_col.dtype)
+ assert_series_equal(df.datetime_col, exp_dates_str)
+ else:
+ # It is a single naive datetime, so regardless of mixed_offsets_as_utc the
+ # expected "ideal" result is the same: a datetime64 without time zone info.
+ if overflow_occured:
+ # Strings are returned because of an overflow.
+ assert is_string_dtype(df.datetime_col.dtype)
+ assert_series_equal(df.datetime_col, exp_dates_str)
+ else:
+ # With use_arrow or pandas >= 3.0, old datetimes are parsed correctly.
+ assert is_datetime64_dtype(df.datetime_col)
+ assert df.datetime_col.iloc[0] == pd.Timestamp(1670, 1, 1, 9, 0, 0)
+ assert df.datetime_col.iloc[0].unit == "ms"
+
+
+ at pytest.mark.parametrize("ext", [ext for ext in ALL_EXTS if ext != ".shp"])
+ at pytest.mark.parametrize("datetime_as_string", [False, True])
+ at pytest.mark.parametrize("mixed_offsets_as_utc", [False, True])
+ at pytest.mark.requires_arrow_write_api
+def test_write_read_datetime_no_tz(
+ tmp_path, ext, datetime_as_string, mixed_offsets_as_utc, use_arrow
+):
+ """Test writing/reading a column with naive datetimes (no time zone information)."""
+ dates_raw = ["2020-01-01T09:00:00.123", "2020-01-01T10:00:00", np.nan]
+ if PANDAS_GE_20:
+ dates = pd.to_datetime(dates_raw, format="ISO8601").as_unit("ms")
+ else:
+ dates = pd.to_datetime(dates_raw)
df = gp.GeoDataFrame(
- {"dates": localised_col, "geometry": [Point(1, 1), Point(1, 1)]},
+ {"dates": dates, "geometry": [Point(1, 1)] * 3}, crs="EPSG:4326"
+ )
+
+ fpath = tmp_path / f"test{ext}"
+ write_dataframe(df, fpath, use_arrow=use_arrow)
+ result = read_dataframe(
+ fpath,
+ use_arrow=use_arrow,
+ datetime_as_string=datetime_as_string,
+ mixed_offsets_as_utc=mixed_offsets_as_utc,
+ )
+
+ if use_arrow and ext == ".gpkg" and __gdal_version__ < (3, 11, 0):
+ # With GDAL < 3.11 with arrow, columns with naive datetimes are written
+ # correctly, but when read they are wrongly interpreted as being in UTC.
+ # The reason is complicated, but more info can be found e.g. here:
+ # https://github.com/geopandas/pyogrio/issues/487#issuecomment-2423762807
+ exp_dates = df.dates.dt.tz_localize("UTC")
+ if datetime_as_string:
+ exp_dates = exp_dates.astype("str").str.replace(" ", "T")
+ exp_dates[2] = np.nan
+ assert_series_equal(result.dates, exp_dates)
+ elif not mixed_offsets_as_utc:
+ assert_series_equal(result.dates, exp_dates)
+ # XFAIL: naive datetimes read wrong in GPKG with GDAL < 3.11 via arrow
+
+ elif datetime_as_string:
+ assert is_string_dtype(result.dates.dtype)
+ if use_arrow and __gdal_version__ < (3, 11, 0):
+ dates_str = df.dates.astype("str").str.replace(" ", "T")
+ dates_str[2] = np.nan
+ else:
+ dates_str = pd.Series(dates_raw, name="dates")
+ assert_series_equal(result.dates, dates_str)
+ else:
+ assert is_datetime64_dtype(result.dates.dtype)
+ assert_geodataframe_equal(result, df)
+
+
+ at pytest.mark.parametrize("ext", [ext for ext in ALL_EXTS if ext != ".shp"])
+ at pytest.mark.parametrize("datetime_as_string", [False, True])
+ at pytest.mark.parametrize("mixed_offsets_as_utc", [False, True])
+ at pytest.mark.filterwarnings("ignore: Non-conformant content for record 1 in column ")
+ at pytest.mark.requires_arrow_write_api
+def test_write_read_datetime_tz(
+ request, tmp_path, ext, datetime_as_string, mixed_offsets_as_utc, use_arrow
+):
+ """Write and read file with all equal time zones.
+
+ This should result in the result being in pandas datetime64 dtype column.
+ """
+ if use_arrow and __gdal_version__ < (3, 10, 0) and ext in (".geojson", ".geojsonl"):
+ # With GDAL < 3.10 with arrow, the time zone offset was applied to the datetime
+ # as well as retaining the time zone.
+ # This was fixed in https://github.com/OSGeo/gdal/pull/11049
+ request.node.add_marker(
+ pytest.mark.xfail(
+ reason="Wrong datetimes read in GeoJSON with GDAL < 3.10 via arrow"
+ )
+ )
+
+ dates_raw = ["2020-01-01T09:00:00.123-05:00", "2020-01-01T10:00:00-05:00", np.nan]
+ if PANDAS_GE_20:
+ dates = pd.to_datetime(dates_raw, format="ISO8601").as_unit("ms")
+ else:
+ dates = pd.to_datetime(dates_raw)
+
+ # Make the index non-consecutive to test this case as well. Added for issue
+ # https://github.com/geopandas/pyogrio/issues/324
+ df = gp.GeoDataFrame(
+ {"dates": dates, "geometry": [Point(1, 1)] * 3},
+ index=[0, 2, 3],
crs="EPSG:4326",
)
- fpath = tmp_path / "test.gpkg"
+ assert isinstance(df.dates.dtype, pd.DatetimeTZDtype)
+
+ fpath = tmp_path / f"test{ext}"
+ write_dataframe(df, fpath, use_arrow=use_arrow)
+ result = read_dataframe(
+ fpath,
+ use_arrow=use_arrow,
+ datetime_as_string=datetime_as_string,
+ mixed_offsets_as_utc=mixed_offsets_as_utc,
+ )
+
+ # With some older versions, the offset is represented slightly differently
+ if result.dates.dtype.name.endswith(", pytz.FixedOffset(-300)]"):
+ result.dates = result.dates.astype(df.dates.dtype)
+
+ if use_arrow and ext in (".fgb", ".gpkg") and __gdal_version__ < (3, 11, 0):
+ # With GDAL < 3.11 with arrow, datetime columns are written as string type
+ df_exp = df.copy()
+ df_exp.dates = df_exp[df_exp.dates.notna()].dates.astype(str)
+ assert_series_equal(result.dates, df_exp.dates, check_index=False)
+ # XFAIL: datetime columns written as string with GDAL < 3.11 via arrow
+ elif datetime_as_string:
+ assert is_string_dtype(result.dates.dtype)
+ if use_arrow and __gdal_version__ < (3, 11, 0):
+ dates_str = df.dates.astype("str").str.replace(" ", "T")
+ dates_str.iloc[2] = np.nan
+ elif __gdal_version__ < (3, 7, 0):
+ # With GDAL < 3.7, time zone minutes aren't included in the string
+ dates_str = [x[:-3] for x in dates_raw if pd.notna(x)] + [np.nan]
+ dates_str = pd.Series(dates_str, name="dates")
+ else:
+ dates_str = pd.Series(dates_raw, name="dates")
+ assert_series_equal(result.dates, dates_str, check_index=False)
+ else:
+ assert_series_equal(result.dates, df.dates, check_index=False)
+
+
+ at pytest.mark.parametrize("ext", [ext for ext in ALL_EXTS if ext != ".shp"])
+ at pytest.mark.parametrize("datetime_as_string", [False, True])
+ at pytest.mark.parametrize("mixed_offsets_as_utc", [False, True])
+ at pytest.mark.filterwarnings(
+ "ignore: Non-conformant content for record 1 in column dates"
+)
+ at pytest.mark.requires_arrow_write_api
+def test_write_read_datetime_tz_localized_mixed_offset(
+ tmp_path, ext, datetime_as_string, mixed_offsets_as_utc, use_arrow
+):
+ """Test with localized dates across a different summer/winter time zone offset."""
+ # Australian Summer Time AEDT (GMT+11), Standard Time AEST (GMT+10)
+ dates_raw = ["2023-01-01 11:00:01.111", "2023-06-01 10:00:01.111", np.nan]
+ dates_naive = pd.Series(pd.to_datetime(dates_raw), name="dates")
+ dates_local = dates_naive.dt.tz_localize("Australia/Sydney")
+ dates_local_offsets_str = dates_local.astype(str)
+ if datetime_as_string:
+ exp_dates = dates_local_offsets_str.str.replace(" ", "T")
+ exp_dates = exp_dates.str.replace(".111000", ".111")
+ if __gdal_version__ < (3, 7, 0):
+ # With GDAL < 3.7, time zone minutes aren't included in the string
+ exp_dates = exp_dates.str.slice(0, -3)
+ elif mixed_offsets_as_utc:
+ exp_dates = dates_local.dt.tz_convert("UTC")
+ if PANDAS_GE_20:
+ exp_dates = exp_dates.dt.as_unit("ms")
+ else:
+ exp_dates = dates_local_offsets_str.apply(
+ lambda x: pd.Timestamp(x) if pd.notna(x) else None
+ )
+
+ df = gp.GeoDataFrame(
+ {"dates": dates_local, "geometry": [Point(1, 1)] * 3}, crs="EPSG:4326"
+ )
+ fpath = tmp_path / f"test{ext}"
+ write_dataframe(df, fpath, use_arrow=use_arrow)
+ result = read_dataframe(
+ fpath,
+ use_arrow=use_arrow,
+ datetime_as_string=datetime_as_string,
+ mixed_offsets_as_utc=mixed_offsets_as_utc,
+ )
+
+ if use_arrow and __gdal_version__ < (3, 11, 0):
+ if ext in (".geojson", ".geojsonl"):
+ # With GDAL < 3.11 with arrow, GDAL converts mixed time zone datetimes to
+ # UTC when read as the arrow datetime column type does not support mixed tz.
+ dates_utc = dates_local.dt.tz_convert("UTC")
+ if PANDAS_GE_20:
+ dates_utc = dates_utc.dt.as_unit("ms")
+ if datetime_as_string:
+ assert is_string_dtype(result.dates.dtype)
+ dates_utc = dates_utc.astype(str).str.replace(" ", "T")
+ assert pd.isna(result.dates[2])
+ assert_series_equal(result.dates.head(2), dates_utc.head(2))
+ # XFAIL: mixed tz datetimes converted to UTC with GDAL < 3.11 + arrow
+ return
+
+ elif ext in (".gpkg", ".fgb"):
+ # With GDAL < 3.11 with arrow, datetime columns written as string type
+ assert pd.isna(result.dates[2])
+ assert_series_equal(result.dates.head(2), dates_local_offsets_str.head(2))
+ # XFAIL: datetime columns written as string with GDAL < 3.11 + arrow
+ return
+
+ # GDAL tz only encodes offsets, not time zones
+ if datetime_as_string:
+ assert is_string_dtype(result.dates.dtype)
+ elif mixed_offsets_as_utc:
+ assert isinstance(result.dates.dtype, pd.DatetimeTZDtype)
+ else:
+ assert is_object_dtype(result.dates.dtype)
+
+ # Check isna for the third value seperately as depending on versions this is
+ # different + pandas 3.0 assert_series_equal becomes strict about this.
+ assert pd.isna(result.dates[2])
+ assert_series_equal(result.dates.head(2), exp_dates.head(2))
+
+
+ at pytest.mark.parametrize("ext", [ext for ext in ALL_EXTS if ext != ".shp"])
+ at pytest.mark.parametrize("datetime_as_string", [False, True])
+ at pytest.mark.parametrize("mixed_offsets_as_utc", [False, True])
+ at pytest.mark.filterwarnings(
+ "ignore: Non-conformant content for record 1 in column dates"
+)
+ at pytest.mark.requires_arrow_write_api
+def test_write_read_datetime_tz_mixed_offsets(
+ tmp_path, ext, datetime_as_string, mixed_offsets_as_utc, use_arrow
+):
+ """Test with dates with mixed time zone offsets."""
+ # Pandas datetime64 column types doesn't support mixed time zone offsets, so
+ # it needs to be a list of pandas.Timestamp objects instead.
+ dates = [
+ pd.Timestamp("2023-01-01 11:00:01.111+01:00"),
+ pd.Timestamp("2023-06-01 10:00:01.111+05:00"),
+ np.nan,
+ ]
+
+ df = gp.GeoDataFrame(
+ {"dates": dates, "geometry": [Point(1, 1)] * 3}, crs="EPSG:4326"
+ )
+ fpath = tmp_path / f"test{ext}"
write_dataframe(df, fpath, use_arrow=use_arrow)
- result = read_dataframe(fpath, use_arrow=use_arrow)
- # GDAL tz only encodes offsets, not timezones
- # check multiple offsets are read as utc datetime instead of string values
- assert_series_equal(result["dates"], utc_col)
+ result = read_dataframe(
+ fpath,
+ use_arrow=use_arrow,
+ datetime_as_string=datetime_as_string,
+ mixed_offsets_as_utc=mixed_offsets_as_utc,
+ )
+
+ if use_arrow and __gdal_version__ < (3, 11, 0):
+ if ext in (".geojson", ".geojsonl"):
+ # With GDAL < 3.11 with arrow, GDAL converts mixed time zone datetimes to
+ # UTC when read as the arrow datetime column type does not support mixed tz.
+ df_exp = df.copy()
+ df_exp.dates = pd.to_datetime(dates, utc=True)
+ if PANDAS_GE_20:
+ df_exp.dates = df_exp.dates.dt.as_unit("ms")
+ if datetime_as_string:
+ df_exp.dates = df_exp.dates.astype("str").str.replace(" ", "T")
+ df_exp.loc[2, "dates"] = pd.NA
+ assert_geodataframe_equal(result, df_exp)
+ # XFAIL: mixed tz datetimes converted to UTC with GDAL < 3.11 + arrow
+ return
+
+ elif ext in (".gpkg", ".fgb"):
+ # With arrow and GDAL < 3.11, mixed time zone datetimes are written as
+ # string type columns, so no proper roundtrip possible.
+ df_exp = df.copy()
+ df_exp.dates = df_exp.dates.astype("string").astype("O")
+ assert_geodataframe_equal(result, df_exp)
+ # XFAIL: mixed tz datetimes converted to UTC with GDAL < 3.11 + arrow
+ return
+
+ if datetime_as_string:
+ assert is_string_dtype(result.dates.dtype)
+ dates_str = df.dates.map(
+ lambda x: x.isoformat(timespec="milliseconds") if pd.notna(x) else np.nan
+ )
+ if __gdal_version__ < (3, 7, 0):
+ # With GDAL < 3.7, time zone minutes aren't included in the string
+ dates_str = dates_str.str.slice(0, -3)
+ assert_series_equal(result.dates, dates_str)
+ elif mixed_offsets_as_utc:
+ assert isinstance(result.dates.dtype, pd.DatetimeTZDtype)
+ exp_dates = pd.to_datetime(df.dates, utc=True)
+ if PANDAS_GE_20:
+ exp_dates = exp_dates.dt.as_unit("ms")
+ assert_series_equal(result.dates, exp_dates)
+ else:
+ assert is_object_dtype(result.dates.dtype)
+ assert_geodataframe_equal(result, df)
+ at pytest.mark.parametrize("ext", [ext for ext in ALL_EXTS if ext != ".shp"])
+ at pytest.mark.parametrize(
+ "dates_raw",
+ [
+ (
+ pd.Timestamp("2020-01-01T09:00:00.123-05:00"),
+ pd.Timestamp("2020-01-01T10:00:00-05:00"),
+ np.nan,
+ ),
+ (
+ datetime.fromisoformat("2020-01-01T09:00:00.123-05:00"),
+ datetime.fromisoformat("2020-01-01T10:00:00-05:00"),
+ np.nan,
+ ),
+ ],
+)
+ at pytest.mark.parametrize("datetime_as_string", [False, True])
+ at pytest.mark.parametrize("mixed_offsets_as_utc", [False, True])
@pytest.mark.filterwarnings(
"ignore: Non-conformant content for record 1 in column dates"
)
@pytest.mark.requires_arrow_write_api
-def test_read_write_datetime_tz_with_nulls(tmp_path, use_arrow):
- dates_raw = ["2020-01-01T09:00:00.123-05:00", "2020-01-01T10:00:00-05:00", pd.NaT]
+def test_write_read_datetime_tz_objects(
+ tmp_path, dates_raw, ext, use_arrow, datetime_as_string, mixed_offsets_as_utc
+):
+ """Datetime objects with equal offsets are read as datetime64."""
+ dates = pd.Series(dates_raw, dtype="O")
+ df = gp.GeoDataFrame(
+ {"dates": dates, "geometry": [Point(1, 1)] * 3}, crs="EPSG:4326"
+ )
+
+ fpath = tmp_path / f"test{ext}"
+ write_dataframe(df, fpath, use_arrow=use_arrow)
+ result = read_dataframe(
+ fpath,
+ use_arrow=use_arrow,
+ datetime_as_string=datetime_as_string,
+ mixed_offsets_as_utc=mixed_offsets_as_utc,
+ )
+
+ # Check result
+ if PANDAS_GE_20:
+ exp_dates = pd.to_datetime(dates_raw, format="ISO8601").as_unit("ms")
+ else:
+ exp_dates = pd.to_datetime(dates_raw)
+ exp_df = df.copy()
+ exp_df["dates"] = pd.Series(exp_dates, name="dates")
+
+ # With some older versions, the offset is represented slightly differently
+ if result.dates.dtype.name.endswith(", pytz.FixedOffset(-300)]"):
+ result["dates"] = result.dates.astype(exp_df.dates.dtype)
+
+ if use_arrow and __gdal_version__ < (3, 10, 0) and ext in (".geojson", ".geojsonl"):
+ # XFAIL: Wrong datetimes read in GeoJSON with GDAL < 3.10 via arrow.
+ # The time zone offset was applied to the datetime as well as retaining
+ # the time zone. This was fixed in https://github.com/OSGeo/gdal/pull/11049
+
+ # Subtract 5 hours from the expected datetimes to match the wrong result.
+ if datetime_as_string:
+ exp_df["dates"] = pd.Series(
+ [
+ "2020-01-01T04:00:00.123000-05:00",
+ "2020-01-01T05:00:00-05:00",
+ np.nan,
+ ]
+ )
+ else:
+ exp_df["dates"] = exp_df.dates - pd.Timedelta(hours=5)
+ if PANDAS_GE_20:
+ # The unit needs to be applied again apparently
+ exp_df["dates"] = exp_df.dates.dt.as_unit("ms")
+ assert_geodataframe_equal(result, exp_df)
+ return
+
+ if use_arrow and __gdal_version__ < (3, 11, 0) and ext in (".fgb", ".gpkg"):
+ # XFAIL: datetime columns are written as string with GDAL < 3.11 + arrow
+ # -> custom formatting because the df column is object dtype and thus
+ # astype(str) converted the datetime objects one by one
+ exp_df["dates"] = pd.Series(
+ ["2020-01-01 09:00:00.123000-05:00", "2020-01-01 10:00:00-05:00", np.nan]
+ )
+ assert_geodataframe_equal(result, exp_df)
+ return
+
+ if datetime_as_string:
+ assert is_string_dtype(result.dates.dtype)
+ if use_arrow and __gdal_version__ < (3, 11, 0):
+ # With GDAL < 3.11 with arrow, datetime columns are written as string type
+ exp_df["dates"] = pd.Series(
+ [
+ "2020-01-01T09:00:00.123000-05:00",
+ "2020-01-01T10:00:00-05:00",
+ np.nan,
+ ]
+ )
+ else:
+ exp_df["dates"] = pd.Series(
+ ["2020-01-01T09:00:00.123-05:00", "2020-01-01T10:00:00-05:00", np.nan]
+ )
+ if __gdal_version__ < (3, 7, 0):
+ # With GDAL < 3.7, time zone minutes aren't included in the string
+ exp_df["dates"] = exp_df.dates.str.slice(0, -3)
+ elif mixed_offsets_as_utc:
+ # the offsets are all -05:00, so the result retains the offset and not UTC
+ assert isinstance(result.dates.dtype, pd.DatetimeTZDtype)
+ assert str(result.dates.dtype.tz) in ("UTC-05:00", "pytz.FixedOffset(-300)")
+ else:
+ assert isinstance(result.dates.dtype, pd.DatetimeTZDtype)
+
+ assert_geodataframe_equal(result, exp_df)
+
+
+ at pytest.mark.parametrize("ext", [ext for ext in ALL_EXTS if ext != ".shp"])
+ at pytest.mark.parametrize("datetime_as_string", [False, True])
+ at pytest.mark.parametrize("mixed_offsets_as_utc", [False, True])
+ at pytest.mark.requires_arrow_write_api
+def test_write_read_datetime_utc(
+ tmp_path, ext, use_arrow, datetime_as_string, mixed_offsets_as_utc
+):
+ """Test writing/reading a column with UTC datetimes."""
+ dates_raw = ["2020-01-01T09:00:00.123Z", "2020-01-01T10:00:00Z", np.nan]
if PANDAS_GE_20:
dates = pd.to_datetime(dates_raw, format="ISO8601").as_unit("ms")
else:
dates = pd.to_datetime(dates_raw)
df = gp.GeoDataFrame(
- {"dates": dates, "geometry": [Point(1, 1), Point(1, 1), Point(1, 1)]},
- crs="EPSG:4326",
+ {"dates": dates, "geometry": [Point(1, 1)] * 3}, crs="EPSG:4326"
)
- fpath = tmp_path / "test.gpkg"
+ assert df.dates.dtype.name in ("datetime64[ms, UTC]", "datetime64[ns, UTC]")
+
+ fpath = tmp_path / f"test{ext}"
write_dataframe(df, fpath, use_arrow=use_arrow)
- result = read_dataframe(fpath, use_arrow=use_arrow)
- if use_arrow:
- # with Arrow, the datetimes are always read as UTC
- df["dates"] = df["dates"].dt.tz_convert("UTC")
- assert_geodataframe_equal(df, result)
+ result = read_dataframe(
+ fpath,
+ use_arrow=use_arrow,
+ datetime_as_string=datetime_as_string,
+ mixed_offsets_as_utc=mixed_offsets_as_utc,
+ )
+
+ if use_arrow and ext == ".fgb" and __gdal_version__ < (3, 11, 0):
+ # With GDAL < 3.11 with arrow, time zone information is dropped when reading
+ # .fgb
+ if datetime_as_string:
+ assert is_string_dtype(result.dates.dtype)
+ dates_str = pd.Series(
+ ["2020-01-01T09:00:00.123", "2020-01-01T10:00:00.000", np.nan],
+ name="dates",
+ )
+ assert_series_equal(result.dates, dates_str)
+ else:
+ assert_series_equal(result.dates, df.dates.dt.tz_localize(None))
+ # XFAIL: UTC datetimes read wrong in .fgb with GDAL < 3.11 via arrow
+ elif datetime_as_string:
+ assert is_string_dtype(result.dates.dtype)
+ if use_arrow and __gdal_version__ < (3, 11, 0):
+ dates_str = df.dates.astype("str").str.replace(" ", "T")
+ dates_str[2] = np.nan
+ else:
+ dates_str = pd.Series(dates_raw, name="dates")
+ if __gdal_version__ < (3, 7, 0):
+ # With GDAL < 3.7, datetime ends with +00 for UTC, not Z
+ dates_str = dates_str.str.replace("Z", "+00")
+ assert_series_equal(result.dates, dates_str)
+ else:
+ assert result.dates.dtype.name in ("datetime64[ms, UTC]", "datetime64[ns, UTC]")
+ assert_geodataframe_equal(result, df)
def test_read_null_values(tmp_path, use_arrow):
@@ -1291,7 +1920,9 @@ def test_write_None_string_column(tmp_path, use_arrow):
assert filename.exists()
result_gdf = read_dataframe(filename, use_arrow=use_arrow)
- if PANDAS_GE_30 and use_arrow:
+ if (
+ PANDAS_GE_30 or (PANDAS_GE_23 and pd.options.future.infer_string)
+ ) and use_arrow:
assert result_gdf.object_col.dtype == "str"
gdf["object_col"] = gdf["object_col"].astype("str")
else:
@@ -1349,12 +1980,6 @@ def test_write_dataframe_gpkg_multiple_layers(tmp_path, naturalearth_lowres, use
@pytest.mark.parametrize("ext", ALL_EXTS)
@pytest.mark.requires_arrow_write_api
def test_write_dataframe_append(request, tmp_path, naturalearth_lowres, ext, use_arrow):
- if ext == ".fgb" and __gdal_version__ <= (3, 5, 0):
- pytest.skip("Append to FlatGeobuf fails for GDAL <= 3.5.0")
-
- if ext in (".geojsonl", ".geojsons") and __gdal_version__ <= (3, 6, 0):
- pytest.skip("Append to GeoJSONSeq only available for GDAL >= 3.6.0")
-
if use_arrow and ext.startswith(".geojson"):
# Bug in GDAL when appending int64 to GeoJSON
# (https://github.com/OSGeo/gdal/issues/9792)
@@ -2005,7 +2630,7 @@ def test_read_dataset_kwargs(nested_geojson_file, use_arrow):
expected = gp.GeoDataFrame(
{
"top_level": ["A"],
- "intermediate_level": ['{ "bottom_level": "B" }'],
+ "intermediate_level": [{"bottom_level": "B"}],
},
geometry=[shapely.Point(0, 0)],
crs="EPSG:4326",
@@ -2198,6 +2823,29 @@ def test_arrow_bool_exception(tmp_path, ext):
_ = read_dataframe(filename, use_arrow=True)
+ at requires_pyarrow_api
+def test_arrow_enable_with_environment_variable(tmp_path):
+ """Test if arrow can be enabled via an environment variable."""
+ # Latin 1 / Western European
+ encoding = "CP1252"
+ text = "ÿ"
+ test_path = tmp_path / "test.gpkg"
+
+ df = gp.GeoDataFrame({text: [text], "geometry": [Point(0, 0)]}, crs="EPSG:4326")
+ write_dataframe(df, test_path, encoding=encoding)
+
+ # Without arrow, specifying the encoding is supported
+ result = read_dataframe(test_path, encoding="cp1252")
+ assert result is not None
+
+ # With arrow enabled, specifying the encoding is not supported
+ with use_arrow_context():
+ with pytest.raises(
+ ValueError, match="non-UTF-8 encoding is not supported for Arrow"
+ ):
+ _ = read_dataframe(test_path, encoding="cp1252")
+
+
@pytest.mark.filterwarnings("ignore:File /vsimem:RuntimeWarning")
@pytest.mark.parametrize("driver", ["GeoJSON", "GPKG"])
def test_write_memory(naturalearth_lowres, driver):
@@ -2242,9 +2890,6 @@ def test_write_memory_driver_required(naturalearth_lowres):
@pytest.mark.parametrize("driver", ["ESRI Shapefile", "OpenFileGDB"])
def test_write_memory_unsupported_driver(naturalearth_lowres, driver):
- if driver == "OpenFileGDB" and __gdal_version__ < (3, 6, 0):
- pytest.skip("OpenFileGDB write support only available for GDAL >= 3.6.0")
-
df = read_dataframe(naturalearth_lowres)
buffer = BytesIO()
@@ -2479,27 +3124,43 @@ def test_write_kml_file_coordinate_order(tmp_path, use_arrow):
assert np.array_equal(gdf_in.geometry.values, points)
- if "LIBKML" in list_drivers():
- # test appending to the existing file only if LIBKML is available
- # as it appears to fall back on LIBKML driver when appending.
- points_append = [Point(7, 8), Point(9, 10), Point(11, 12)]
- gdf_append = gp.GeoDataFrame(geometry=points_append, crs="EPSG:4326")
- write_dataframe(
- gdf_append,
- output_path,
- layer="tmp_layer",
- driver="KML",
- use_arrow=use_arrow,
- append=True,
- )
- # force_2d used to only compare xy geometry as z-dimension is undesirably
- # introduced when the kml file is over-written.
- gdf_in_appended = read_dataframe(
- output_path, use_arrow=use_arrow, force_2d=True
- )
+ at pytest.mark.requires_arrow_write_api
+ at pytest.mark.skipif(
+ "LIBKML" not in list_drivers(),
+ reason="LIBKML driver is not available and is needed to append to .kml",
+)
+def test_write_kml_append(tmp_path, use_arrow):
+ """Append features to an existing KML file.
- assert np.array_equal(gdf_in_appended.geometry.values, points + points_append)
+ Appending is only supported by the LIBKML driver, and the driver isn't
+ included in the GDAL ubuntu-small images, so skip if not available.
+ """
+ points = [Point(10, 20), Point(30, 40), Point(50, 60)]
+ gdf = gp.GeoDataFrame(geometry=points, crs="EPSG:4326")
+ output_path = tmp_path / "test.kml"
+ write_dataframe(
+ gdf, output_path, layer="tmp_layer", driver="KML", use_arrow=use_arrow
+ )
+
+ # test appending to the existing file only if LIBKML is available
+ # as it appears to fall back on LIBKML driver when appending.
+ points_append = [Point(7, 8), Point(9, 10), Point(11, 12)]
+ gdf_append = gp.GeoDataFrame(geometry=points_append, crs="EPSG:4326")
+
+ write_dataframe(
+ gdf_append,
+ output_path,
+ layer="tmp_layer",
+ driver="KML",
+ use_arrow=use_arrow,
+ append=True,
+ )
+ # force_2d is used to only compare the xy dimensions of the geometry, as the LIBKML
+ # driver always adds the z-dimension when the kml file is over-written.
+ gdf_in_appended = read_dataframe(output_path, use_arrow=use_arrow, force_2d=True)
+
+ assert np.array_equal(gdf_in_appended.geometry.values, points + points_append)
@pytest.mark.requires_arrow_write_api
@@ -2537,3 +3198,21 @@ def test_write_geojson_rfc7946_coordinates(tmp_path, use_arrow):
gdf_in_appended = read_dataframe(output_path, use_arrow=use_arrow)
assert np.array_equal(gdf_in_appended.geometry.values, points + points_append)
+
+
+ at pytest.mark.requires_arrow_api
+ at pytest.mark.skipif(
+ not GDAL_HAS_PARQUET_DRIVER, reason="Parquet driver is not available"
+)
+def test_parquet_driver(tmp_path, use_arrow):
+ """
+ Simple test verifying the Parquet driver works if available
+ """
+ gdf = gp.GeoDataFrame(
+ {"col": [1, 2, 3], "geometry": [Point(0, 0), Point(1, 1), Point(2, 2)]},
+ crs="EPSG:4326",
+ )
+ output_path = tmp_path / "test.parquet"
+ write_dataframe(gdf, output_path, use_arrow=use_arrow)
+ result = read_dataframe(output_path, use_arrow=use_arrow)
+ assert_geodataframe_equal(result, gdf)
=====================================
pyogrio/tests/test_raw_io.py
=====================================
@@ -24,7 +24,6 @@ from pyogrio.tests.conftest import (
DRIVER_EXT,
DRIVERS,
prepare_testfile,
- requires_arrow_api,
requires_pyarrow_api,
requires_shapely,
)
@@ -616,10 +615,6 @@ def test_write_no_geom_no_fields():
write("test.gpkg", geometry=None, field_data=None, fields=None)
- at pytest.mark.skipif(
- __gdal_version__ < (3, 6, 0),
- reason="OpenFileGDB write support only available for GDAL >= 3.6.0",
-)
@pytest.mark.parametrize(
"write_int64",
[
@@ -698,12 +693,6 @@ def test_write_openfilegdb(tmp_path, write_int64):
@pytest.mark.parametrize("ext", DRIVERS)
def test_write_append(tmp_path, naturalearth_lowres, ext):
- if ext == ".fgb" and __gdal_version__ <= (3, 5, 0):
- pytest.skip("Append to FlatGeobuf fails for GDAL <= 3.5.0")
-
- if ext in (".geojsonl", ".geojsons") and __gdal_version__ < (3, 6, 0):
- pytest.skip("Append to GeoJSONSeq only available for GDAL >= 3.6.0")
-
if ext == ".gpkg.zip":
pytest.skip("Append to .gpkg.zip is not supported")
@@ -725,11 +714,8 @@ def test_write_append(tmp_path, naturalearth_lowres, ext):
assert read_info(filename)["features"] == 354
- at pytest.mark.parametrize("driver,ext", [("GML", ".gml"), ("GeoJSONSeq", ".geojsons")])
+ at pytest.mark.parametrize("driver,ext", [("GML", ".gml")])
def test_write_append_unsupported(tmp_path, naturalearth_lowres, driver, ext):
- if ext == ".geojsons" and __gdal_version__ >= (3, 6, 0):
- pytest.skip("Append to GeoJSONSeq supported for GDAL >= 3.6.0")
-
meta, _, geometry, field_data = read(naturalearth_lowres)
# GML does not support append functionality
@@ -744,27 +730,6 @@ def test_write_append_unsupported(tmp_path, naturalearth_lowres, driver, ext):
write(filename, geometry, field_data, driver=driver, append=True, **meta)
- at pytest.mark.skipif(
- __gdal_version__ > (3, 5, 0),
- reason="segfaults on FlatGeobuf limited to GDAL <= 3.5.0",
-)
-def test_write_append_prevent_gdal_segfault(tmp_path, naturalearth_lowres):
- """GDAL <= 3.5.0 segfaults when appending to FlatGeobuf; this test
- verifies that we catch that before segfault"""
- meta, _, geometry, field_data = read(naturalearth_lowres)
- meta["geometry_type"] = "MultiPolygon"
-
- filename = tmp_path / "test.fgb"
- write(filename, geometry, field_data, **meta)
-
- assert filename.exists()
-
- with pytest.raises(
- RuntimeError, # match="append to FlatGeobuf is not supported for GDAL <= 3.5.0"
- ):
- write(filename, geometry, field_data, append=True, **meta)
-
-
@pytest.mark.parametrize(
"driver",
{
@@ -794,16 +759,14 @@ def test_write_supported(tmp_path, naturalearth_lowres, driver):
assert filename.exists()
- at pytest.mark.skipif(
- __gdal_version__ >= (3, 6, 0), reason="OpenFileGDB supports write for GDAL >= 3.6.0"
-)
def test_write_unsupported(tmp_path, naturalearth_lowres):
+ """Test writing using a driver that does not support writing."""
meta, _, geometry, field_data = read(naturalearth_lowres)
- filename = tmp_path / "test.gdb"
+ filename = tmp_path / "test.topojson"
with pytest.raises(DataSourceError, match="does not support write functionality"):
- write(filename, geometry, field_data, driver="OpenFileGDB", **meta)
+ write(filename, geometry, field_data, driver="TopoJSON", **meta)
def test_write_gdalclose_error(naturalearth_lowres):
@@ -1005,15 +968,6 @@ def test_read_data_types_numeric_with_null(test_gpkg_nulls):
assert field.dtype == "float64"
-def test_read_unsupported_types(list_field_values_file):
- fields = read(list_field_values_file)[3]
- # list field gets skipped, only integer field is read
- assert len(fields) == 1
-
- fields = read(list_field_values_file, columns=["int64"])[3]
- assert len(fields) == 1
-
-
def test_read_datetime_millisecond(datetime_file):
field = read(datetime_file)[3][0]
assert field.dtype == "datetime64[ms]"
@@ -1047,15 +1001,20 @@ def test_read_unsupported_ext_with_prefix(tmp_path):
def test_read_datetime_as_string(datetime_tz_file):
field = read(datetime_tz_file)[3][0]
assert field.dtype == "datetime64[ms]"
- # timezone is ignored in numpy layer
+ # time zone is ignored in numpy layer
assert field[0] == np.datetime64("2020-01-01 09:00:00.123")
assert field[1] == np.datetime64("2020-01-01 10:00:00.000")
field = read(datetime_tz_file, datetime_as_string=True)[3][0]
assert field.dtype == "object"
- # GDAL doesn't return strings in ISO format (yet)
- assert field[0] == "2020/01/01 09:00:00.123-05"
- assert field[1] == "2020/01/01 10:00:00-05"
+
+ if __gdal_version__ < (3, 7, 0):
+ # With GDAL < 3.7, datetimes are not returned as ISO8601 strings
+ assert field[0] == "2020/01/01 09:00:00.123-05"
+ assert field[1] == "2020/01/01 10:00:00-05"
+ else:
+ assert field[0] == "2020-01-01T09:00:00.123-05:00"
+ assert field[1] == "2020-01-01T10:00:00-05:00"
@pytest.mark.parametrize("ext", ["gpkg", "geojson"])
@@ -1187,9 +1146,6 @@ def test_write_memory_driver_required(naturalearth_lowres):
@pytest.mark.parametrize("driver", ["ESRI Shapefile", "OpenFileGDB"])
def test_write_memory_unsupported_driver(naturalearth_lowres, driver):
- if driver == "OpenFileGDB" and __gdal_version__ < (3, 6, 0):
- pytest.skip("OpenFileGDB write support only available for GDAL >= 3.6.0")
-
meta, _, geometry, field_data = read(naturalearth_lowres)
buffer = BytesIO()
@@ -1491,7 +1447,6 @@ def test_write_with_mask(tmp_path):
write(filename, geometry, field_data, fields, field_mask, **meta)
- at requires_arrow_api
def test_open_arrow_capsule_protocol_without_pyarrow(naturalearth_lowres):
# this test is included here instead of test_arrow.py to ensure we also run
# it when pyarrow is not installed
@@ -1509,7 +1464,6 @@ def test_open_arrow_capsule_protocol_without_pyarrow(naturalearth_lowres):
@pytest.mark.skipif(HAS_PYARROW, reason="pyarrow is installed")
- at requires_arrow_api
def test_open_arrow_error_no_pyarrow(naturalearth_lowres):
# this test is included here instead of test_arrow.py to ensure we run
# it when pyarrow is not installed
=====================================
pyogrio/util.py
=====================================
@@ -4,13 +4,11 @@ import re
import sys
from packaging.version import Version
from pathlib import Path
-from typing import Union
from urllib.parse import urlparse
+from pyogrio._ogr import MULTI_EXTENSIONS
from pyogrio._vsi import vsimem_rmtree_toplevel as _vsimem_rmtree_toplevel
-MULTI_EXTENSIONS = (".gpkg.zip", ".shp.zip")
-
def get_vsi_path_or_buffer(path_or_buffer):
"""Get VSI-prefixed path or bytes buffer depending on type of path_or_buffer.
@@ -54,7 +52,7 @@ def get_vsi_path_or_buffer(path_or_buffer):
return vsi_path(str(path_or_buffer))
-def vsi_path(path: Union[str, Path]) -> str:
+def vsi_path(path: str | Path) -> str:
"""Ensure path is a local path or a GDAL-compatible VSI path."""
# Convert Path objects to string, but for VSI paths, keep posix style path.
if isinstance(path, Path):
@@ -237,7 +235,7 @@ def _mask_to_wkb(mask):
return shapely.to_wkb(mask)
-def vsimem_rmtree_toplevel(path: Union[str, Path]):
+def vsimem_rmtree_toplevel(path: str | Path):
"""Remove the parent directory of the file path recursively.
This is used for final cleanup of an in-memory dataset, which may have been
=====================================
pyproject.toml
=====================================
@@ -1,7 +1,7 @@
[build-system]
requires = [
"setuptools",
- "Cython>=0.29",
+ "Cython>=3.1",
"versioneer[toml]==0.28",
# tomli is used by versioneer
"tomli; python_version < '3.11'",
@@ -26,12 +26,13 @@ classifiers = [
"Operating System :: OS Independent",
"Programming Language :: Python :: 3",
"Topic :: Scientific/Engineering :: GIS",
+ "Programming Language :: Python :: Free Threading :: 2 - Beta",
]
-requires-python = ">=3.9"
+requires-python = ">=3.10"
dependencies = ["certifi", "numpy", "packaging"]
[project.optional-dependencies]
-dev = ["cython"]
+dev = ["cython>=3.1"]
test = ["pytest", "pytest-cov"]
benchmark = ["pytest-benchmark"]
geopandas = ["geopandas"]
@@ -41,17 +42,18 @@ Home = "https://pyogrio.readthedocs.io/"
Repository = "https://github.com/geopandas/pyogrio"
[tool.cibuildwheel]
-skip = ["pp*", "*musllinux*"]
+skip = ["*musllinux*"]
archs = ["auto64"]
manylinux-x86_64-image = "manylinux-x86_64-vcpkg-gdal:latest"
manylinux-aarch64-image = "manylinux-aarch64-vcpkg-gdal:latest"
build-verbosity = 3
+enable = ["cpython-freethreading"]
[tool.cibuildwheel.linux.environment]
VCPKG_INSTALL = "$VCPKG_INSTALLATION_ROOT/installed/$VCPKG_DEFAULT_TRIPLET"
GDAL_INCLUDE_PATH = "$VCPKG_INSTALL/include"
GDAL_LIBRARY_PATH = "$VCPKG_INSTALL/lib"
-GDAL_VERSION = "3.10.3"
+GDAL_VERSION = "3.11.4"
PYOGRIO_PACKAGE_DATA = 1
GDAL_DATA = "$VCPKG_INSTALL/share/gdal"
PROJ_LIB = "$VCPKG_INSTALL/share/proj"
@@ -66,7 +68,7 @@ repair-wheel-command = [
VCPKG_INSTALL = "$VCPKG_INSTALLATION_ROOT/installed/$VCPKG_DEFAULT_TRIPLET"
GDAL_INCLUDE_PATH = "$VCPKG_INSTALL/include"
GDAL_LIBRARY_PATH = "$VCPKG_INSTALL/lib"
-GDAL_VERSION = "3.10.3"
+GDAL_VERSION = "3.11.4"
PYOGRIO_PACKAGE_DATA = 1
GDAL_DATA = "$VCPKG_INSTALL/share/gdal"
PROJ_LIB = "$VCPKG_INSTALL/share/proj"
@@ -80,7 +82,7 @@ repair-wheel-command = "delvewheel repair --add-path C:/vcpkg/installed/x64-wind
VCPKG_INSTALL = "$VCPKG_INSTALLATION_ROOT/installed/x64-windows-dynamic-release"
GDAL_INCLUDE_PATH = "$VCPKG_INSTALL/include"
GDAL_LIBRARY_PATH = "$VCPKG_INSTALL/lib"
-GDAL_VERSION = "3.10.3"
+GDAL_VERSION = "3.11.4"
PYOGRIO_PACKAGE_DATA = 1
GDAL_DATA = "$VCPKG_INSTALL/share/gdal"
PROJ_LIB = "$VCPKG_INSTALL/share/proj"
=====================================
setup.py
=====================================
@@ -20,12 +20,12 @@ except ImportError:
logger = logging.getLogger(__name__)
-MIN_PYTHON_VERSION = (3, 9, 0)
+MIN_PYTHON_VERSION = (3, 10, 0)
MIN_GDAL_VERSION = (2, 4, 0)
if sys.version_info < MIN_PYTHON_VERSION:
- raise RuntimeError("Python >= 3.9 is required")
+ raise RuntimeError("Python >= 3.10 is required")
def copy_data_tree(datadir, destdir):
@@ -169,7 +169,7 @@ else:
Extension("pyogrio._ogr", ["pyogrio/_ogr.pyx"], **ext_options),
Extension("pyogrio._vsi", ["pyogrio/_vsi.pyx"], **ext_options),
],
- compiler_directives={"language_level": "3"},
+ compiler_directives={"language_level": "3", "freethreading_compatible": True},
compile_time_env=compile_time_env,
)
View it on GitLab: https://salsa.debian.org/debian-gis-team/pyogrio/-/compare/f8c2b41acb543c1a4b686e41c5d776fcb9619c5b...d862e40593134a74df9c047247b785cfb9f8fbe1
--
View it on GitLab: https://salsa.debian.org/debian-gis-team/pyogrio/-/compare/f8c2b41acb543c1a4b686e41c5d776fcb9619c5b...d862e40593134a74df9c047247b785cfb9f8fbe1
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/pkg-grass-devel/attachments/20251126/c9012e83/attachment-0001.htm>
More information about the Pkg-grass-devel
mailing list