[med-svn] [Git][python-team/packages/python-pynndescent][master] 5 commits: New upstream version 0.5.5
Andreas Tille (@tille)
gitlab at salsa.debian.org
Fri Feb 18 06:25:30 GMT 2022
Andreas Tille pushed to branch master at Debian Python Team / packages / python-pynndescent
Commits:
7d135bb3 by Andreas Tille at 2022-01-02T18:59:03+01:00
New upstream version 0.5.5
- - - - -
a90e87ae by Andreas Tille at 2022-02-18T07:11:26+01:00
New upstream version 0.5.6
- - - - -
5d5b5760 by Andreas Tille at 2022-02-18T07:11:26+01:00
routine-update: New upstream version
- - - - -
06c6f5ef by Andreas Tille at 2022-02-18T07:11:28+01:00
Update upstream source from tag 'upstream/0.5.6'
Update to upstream version '0.5.6'
with Debian dir 132fc6e9551f52569d8fe1ca044ce15e83d1061d
- - - - -
f37b4e89 by Andreas Tille at 2022-02-18T07:24:17+01:00
Switch test framework from nose to pytest
- - - - -
22 changed files:
- PKG-INFO
- README.rst
- debian/changelog
- debian/control
- debian/rules
- pynndescent.egg-info/PKG-INFO
- pynndescent.egg-info/SOURCES.txt
- pynndescent/__init__.py
- pynndescent/distances.py
- pynndescent/graph_utils.py
- pynndescent/optimal_transport.py
- pynndescent/pynndescent_.py
- pynndescent/rp_trees.py
- pynndescent/sparse.py
- pynndescent/sparse_nndescent.py
- + pynndescent/tests/conftest.py
- pynndescent/tests/test_distances.py
- pynndescent/tests/test_pynndescent_.py
- pynndescent/tests/test_rank.py
- pynndescent/utils.py
- requirements.txt
- setup.py
Changes:
=====================================
PKG-INFO
=====================================
@@ -1,6 +1,6 @@
Metadata-Version: 1.2
Name: pynndescent
-Version: 0.5.2
+Version: 0.5.6
Summary: Nearest Neighbor Descent
Home-page: http://github.com/lmcinnes/pynndescent
Author: Leland McInnes
@@ -8,12 +8,14 @@ Author-email: leland.mcinnes at gmail.com
Maintainer: Leland McInnes
Maintainer-email: leland.mcinnes at gmail.com
License: BSD
-Description: .. image:: https://travis-ci.org/lmcinnes/pynndescent.svg
- :target: https://travis-ci.org/lmcinnes/pynndescent
- :alt: Travis Build Status
- .. image:: https://ci.appveyor.com/api/projects/status/github/lmcinnes/pynndescent?branch=master&svg=true
- :target: https://ci.appveyor.com/project/lmcinnes/pynndescent
- :alt: AppVeyor Build Status
+Description: .. image:: doc/pynndescent_logo.png
+ :width: 600
+ :align: center
+ :alt: PyNNDescent Logo
+
+ .. image:: https://dev.azure.com/lelandmcinnes/UMAP%20project%20builds/_apis/build/status/lmcinnes.pynndescent?branchName=master
+ :target: .. _build_status: https://dev.azure.com/lelandmcinnes/UMAP%20project%20builds/_build/latest?definitionId=2&branchName=master
+ :alt: Azure Pipelines Build Status
.. image:: https://img.shields.io/lgtm/alerts/g/lmcinnes/pynndescent.svg
:target: https://lgtm.com/projects/g/lmcinnes/pynndescent/alerts
:alt: LGTM Alerts
@@ -55,14 +57,14 @@ Description: .. image:: https://travis-ci.org/lmcinnes/pynndescent.svg
`ann-benchmarks <https://github.com/erikbern/ann-benchmarks>`_ system puts it
solidly in the mix of top performing ANN libraries:
- **GIST-960 Euclidean**
+ **SIFT-128 Euclidean**
- .. image:: https://camo.githubusercontent.com/142a48c992ba689b8ea9e62636b5281a97322f74/68747470733a2f2f7261772e6769746875622e636f6d2f6572696b6265726e2f616e6e2d62656e63686d61726b732f6d61737465722f726573756c74732f676973742d3936302d6575636c696465616e2e706e67
- :alt: ANN benchmark performance for GIST 960 dataset
+ .. image:: https://pynndescent.readthedocs.io/en/latest/_images/sift.png
+ :alt: ANN benchmark performance for SIFT 128 dataset
**NYTimes-256 Angular**
- .. image:: https://camo.githubusercontent.com/6120a35a9db64104eaa1c95cb4803c2fc4cd2679/68747470733a2f2f7261772e6769746875622e636f6d2f6572696b6265726e2f616e6e2d62656e63686d61726b732f6d61737465722f726573756c74732f6e7974696d65732d3235362d616e67756c61722e706e67
+ .. image:: https://pynndescent.readthedocs.io/en/latest/_images/nytimes.png
:alt: ANN benchmark performance for NYTimes 256 dataset
While PyNNDescent is among fastest ANN library, it is also both easy to install (pip
=====================================
README.rst
=====================================
@@ -1,9 +1,11 @@
-.. image:: https://travis-ci.org/lmcinnes/pynndescent.svg
- :target: https://travis-ci.org/lmcinnes/pynndescent
- :alt: Travis Build Status
-.. image:: https://ci.appveyor.com/api/projects/status/github/lmcinnes/pynndescent?branch=master&svg=true
- :target: https://ci.appveyor.com/project/lmcinnes/pynndescent
- :alt: AppVeyor Build Status
+.. image:: doc/pynndescent_logo.png
+ :width: 600
+ :align: center
+ :alt: PyNNDescent Logo
+
+.. image:: https://dev.azure.com/lelandmcinnes/UMAP%20project%20builds/_apis/build/status/lmcinnes.pynndescent?branchName=master
+ :target: .. _build_status: https://dev.azure.com/lelandmcinnes/UMAP%20project%20builds/_build/latest?definitionId=2&branchName=master
+ :alt: Azure Pipelines Build Status
.. image:: https://img.shields.io/lgtm/alerts/g/lmcinnes/pynndescent.svg
:target: https://lgtm.com/projects/g/lmcinnes/pynndescent/alerts
:alt: LGTM Alerts
@@ -45,14 +47,14 @@ PyNNDescent provides fast approximate nearest neighbor queries. The
`ann-benchmarks <https://github.com/erikbern/ann-benchmarks>`_ system puts it
solidly in the mix of top performing ANN libraries:
-**GIST-960 Euclidean**
+**SIFT-128 Euclidean**
-.. image:: https://camo.githubusercontent.com/142a48c992ba689b8ea9e62636b5281a97322f74/68747470733a2f2f7261772e6769746875622e636f6d2f6572696b6265726e2f616e6e2d62656e63686d61726b732f6d61737465722f726573756c74732f676973742d3936302d6575636c696465616e2e706e67
- :alt: ANN benchmark performance for GIST 960 dataset
+.. image:: https://pynndescent.readthedocs.io/en/latest/_images/sift.png
+ :alt: ANN benchmark performance for SIFT 128 dataset
**NYTimes-256 Angular**
-.. image:: https://camo.githubusercontent.com/6120a35a9db64104eaa1c95cb4803c2fc4cd2679/68747470733a2f2f7261772e6769746875622e636f6d2f6572696b6265726e2f616e6e2d62656e63686d61726b732f6d61737465722f726573756c74732f6e7974696d65732d3235362d616e67756c61722e706e67
+.. image:: https://pynndescent.readthedocs.io/en/latest/_images/nytimes.png
:alt: ANN benchmark performance for NYTimes 256 dataset
While PyNNDescent is among fastest ANN library, it is also both easy to install (pip
=====================================
debian/changelog
=====================================
@@ -1,9 +1,11 @@
-python-pynndescent (0.5.2+dfsg-2) UNRELEASED; urgency=medium
+python-pynndescent (0.5.6-1) UNRELEASED; urgency=medium
+ * New upstream version
* Move package to Debian Python Team
* Standards-Version: 4.6.0
+ * Switch test framework from nose to pytest
- -- Andreas Tille <tille at debian.org> Sun, 02 Jan 2022 18:58:17 +0100
+ -- Andreas Tille <tille at debian.org> Fri, 18 Feb 2022 07:11:26 +0100
python-pynndescent (0.5.2+dfsg-1) unstable; urgency=medium
=====================================
debian/control
=====================================
@@ -7,12 +7,12 @@ Build-Depends: debhelper-compat (= 13),
dh-python,
python3-setuptools,
python3-all,
- python3-joblib,
- python3-llvmlite,
- python3-numba,
- python3-scipy,
- python3-sklearn,
- python3-nose <!nocheck>
+ python3-joblib <!nocheck>,
+ python3-llvmlite <!nocheck>,
+ python3-numba <!nocheck>,
+ python3-scipy <!nocheck>,
+ python3-sklearn <!nocheck>,
+ python3-pytest <!nocheck>
Standards-Version: 4.6.0
Vcs-Browser: https://salsa.debian.org/python-team/packages/python-pynndescent
Vcs-Git: https://salsa.debian.org/python-team/packages/python-pynndescent.git
=====================================
debian/rules
=====================================
@@ -8,5 +8,5 @@ export PYTHONPATH=$(CURDIR)
override_dh_auto_test:
ifeq (,$(filter nocheck,$(DEB_BUILD_OPTIONS)))
- dh_auto_test -- --system=custom --test-args="PYTHONPATH={build_dir} {interpreter} -m nose -v -w $(CURDIR)/pynndescent/tests"
+ dh_auto_test -- --system=custom --test-args="PYTHONPATH={build_dir} {interpreter} -m pytest -v -w $(CURDIR)/pynndescent/tests"
endif
=====================================
pynndescent.egg-info/PKG-INFO
=====================================
@@ -1,6 +1,6 @@
Metadata-Version: 1.2
Name: pynndescent
-Version: 0.5.2
+Version: 0.5.6
Summary: Nearest Neighbor Descent
Home-page: http://github.com/lmcinnes/pynndescent
Author: Leland McInnes
@@ -8,12 +8,14 @@ Author-email: leland.mcinnes at gmail.com
Maintainer: Leland McInnes
Maintainer-email: leland.mcinnes at gmail.com
License: BSD
-Description: .. image:: https://travis-ci.org/lmcinnes/pynndescent.svg
- :target: https://travis-ci.org/lmcinnes/pynndescent
- :alt: Travis Build Status
- .. image:: https://ci.appveyor.com/api/projects/status/github/lmcinnes/pynndescent?branch=master&svg=true
- :target: https://ci.appveyor.com/project/lmcinnes/pynndescent
- :alt: AppVeyor Build Status
+Description: .. image:: doc/pynndescent_logo.png
+ :width: 600
+ :align: center
+ :alt: PyNNDescent Logo
+
+ .. image:: https://dev.azure.com/lelandmcinnes/UMAP%20project%20builds/_apis/build/status/lmcinnes.pynndescent?branchName=master
+ :target: .. _build_status: https://dev.azure.com/lelandmcinnes/UMAP%20project%20builds/_build/latest?definitionId=2&branchName=master
+ :alt: Azure Pipelines Build Status
.. image:: https://img.shields.io/lgtm/alerts/g/lmcinnes/pynndescent.svg
:target: https://lgtm.com/projects/g/lmcinnes/pynndescent/alerts
:alt: LGTM Alerts
@@ -55,14 +57,14 @@ Description: .. image:: https://travis-ci.org/lmcinnes/pynndescent.svg
`ann-benchmarks <https://github.com/erikbern/ann-benchmarks>`_ system puts it
solidly in the mix of top performing ANN libraries:
- **GIST-960 Euclidean**
+ **SIFT-128 Euclidean**
- .. image:: https://camo.githubusercontent.com/142a48c992ba689b8ea9e62636b5281a97322f74/68747470733a2f2f7261772e6769746875622e636f6d2f6572696b6265726e2f616e6e2d62656e63686d61726b732f6d61737465722f726573756c74732f676973742d3936302d6575636c696465616e2e706e67
- :alt: ANN benchmark performance for GIST 960 dataset
+ .. image:: https://pynndescent.readthedocs.io/en/latest/_images/sift.png
+ :alt: ANN benchmark performance for SIFT 128 dataset
**NYTimes-256 Angular**
- .. image:: https://camo.githubusercontent.com/6120a35a9db64104eaa1c95cb4803c2fc4cd2679/68747470733a2f2f7261772e6769746875622e636f6d2f6572696b6265726e2f616e6e2d62656e63686d61726b732f6d61737465722f726573756c74732f6e7974696d65732d3235362d616e67756c61722e706e67
+ .. image:: https://pynndescent.readthedocs.io/en/latest/_images/nytimes.png
:alt: ANN benchmark performance for NYTimes 256 dataset
While PyNNDescent is among fastest ANN library, it is also both easy to install (pip
=====================================
pynndescent.egg-info/SOURCES.txt
=====================================
@@ -22,15 +22,8 @@ pynndescent.egg-info/not-zip-safe
pynndescent.egg-info/requires.txt
pynndescent.egg-info/top_level.txt
pynndescent/tests/__init__.py
+pynndescent/tests/conftest.py
pynndescent/tests/test_distances.py
pynndescent/tests/test_pynndescent_.py
pynndescent/tests/test_rank.py
-pynndescent/tests/__pycache__/__init__.cpython-37.pyc
-pynndescent/tests/__pycache__/__init__.cpython-38.pyc
-pynndescent/tests/__pycache__/test_distances.cpython-37.pyc
-pynndescent/tests/__pycache__/test_distances.cpython-38-pytest-6.2.2.pyc
-pynndescent/tests/__pycache__/test_pynndescent_.cpython-37.pyc
-pynndescent/tests/__pycache__/test_pynndescent_.cpython-38-pytest-6.2.2.pyc
-pynndescent/tests/__pycache__/test_rank.cpython-37.pyc
-pynndescent/tests/__pycache__/test_rank.cpython-38-pytest-6.2.2.pyc
pynndescent/tests/test_data/cosine_hang.npy
\ No newline at end of file
=====================================
pynndescent/__init__.py
=====================================
@@ -3,6 +3,13 @@ import numba
from .pynndescent_ import NNDescent, PyNNDescentTransformer
# Workaround: https://github.com/numba/numba/issues/3341
-numba.config.THREADING_LAYER = "workqueue"
+if numba.config.THREADING_LAYER == "omp":
+ try:
+ from numba.np.ufunc import tbbpool
+
+ numba.config.THREADING_LAYER = "tbb"
+ except ImportError as e:
+ # might be a missing symbol due to e.g. tbb libraries missing
+ numba.config.THREADING_LAYER = "workqueue"
__version__ = pkg_resources.get_distribution("pynndescent").version
=====================================
pynndescent/distances.py
=====================================
@@ -12,6 +12,7 @@ from pynndescent.optimal_transport import (
network_simplex_core,
total_cost,
ProblemStatus,
+ sinkhorn_transport_plan,
)
_mock_identity = np.eye(2, dtype=np.float32)
@@ -22,7 +23,7 @@ FLOAT32_EPS = np.finfo(np.float32).eps
FLOAT32_MAX = np.finfo(np.float32).max
- at numba.njit(fastmath=True, cache=True)
+ at numba.njit(fastmath=True)
def euclidean(x, y):
r"""Standard euclidean distance.
@@ -66,7 +67,7 @@ def squared_euclidean(x, y):
return result
- at numba.njit(fastmath=True, cache=True)
+ at numba.njit(fastmath=True)
def standardised_euclidean(x, y, sigma=_mock_ones):
r"""Euclidean distance standardised against a vector of standard
deviations per coordinate.
@@ -81,7 +82,7 @@ def standardised_euclidean(x, y, sigma=_mock_ones):
return np.sqrt(result)
- at numba.njit(fastmath=True, cache=True)
+ at numba.njit(fastmath=True)
def manhattan(x, y):
r"""Manhattan, taxicab, or l1 distance.
@@ -95,7 +96,7 @@ def manhattan(x, y):
return result
- at numba.njit(fastmath=True, cache=True)
+ at numba.njit(fastmath=True)
def chebyshev(x, y):
r"""Chebyshev or l-infinity distance.
@@ -109,7 +110,7 @@ def chebyshev(x, y):
return result
- at numba.njit(fastmath=True, cache=True)
+ at numba.njit(fastmath=True)
def minkowski(x, y, p=2):
r"""Minkowski distance.
@@ -128,7 +129,7 @@ def minkowski(x, y, p=2):
return result ** (1.0 / p)
- at numba.njit(fastmath=True, cache=True)
+ at numba.njit(fastmath=True)
def weighted_minkowski(x, y, w=_mock_ones, p=2):
r"""A weighted version of Minkowski distance.
@@ -146,7 +147,7 @@ def weighted_minkowski(x, y, w=_mock_ones, p=2):
return result ** (1.0 / p)
- at numba.njit(fastmath=True, cache=True)
+ at numba.njit(fastmath=True)
def mahalanobis(x, y, vinv=_mock_identity):
result = 0.0
@@ -164,7 +165,7 @@ def mahalanobis(x, y, vinv=_mock_identity):
return np.sqrt(result)
- at numba.njit(fastmath=True, cache=True)
+ at numba.njit(fastmath=True)
def hamming(x, y):
result = 0.0
for i in range(x.shape[0]):
@@ -174,7 +175,7 @@ def hamming(x, y):
return float(result) / x.shape[0]
- at numba.njit(fastmath=True, cache=True)
+ at numba.njit(fastmath=True)
def canberra(x, y):
result = 0.0
for i in range(x.shape[0]):
@@ -185,7 +186,7 @@ def canberra(x, y):
return result
- at numba.njit(fastmath=True, cache=True)
+ at numba.njit(fastmath=True)
def bray_curtis(x, y):
numerator = 0.0
denominator = 0.0
@@ -199,7 +200,7 @@ def bray_curtis(x, y):
return 0.0
- at numba.njit(fastmath=True, cache=True)
+ at numba.njit(fastmath=True)
def jaccard(x, y):
num_non_zero = 0.0
num_equal = 0.0
@@ -255,7 +256,7 @@ def correct_alternative_jaccard(v):
return 1.0 - pow(2.0, -v)
- at numba.njit(fastmath=True, cache=True)
+ at numba.njit(fastmath=True)
def matching(x, y):
num_not_equal = 0.0
for i in range(x.shape[0]):
@@ -266,7 +267,7 @@ def matching(x, y):
return float(num_not_equal) / x.shape[0]
- at numba.njit(fastmath=True, cache=True)
+ at numba.njit(fastmath=True)
def dice(x, y):
num_true_true = 0.0
num_not_equal = 0.0
@@ -282,7 +283,7 @@ def dice(x, y):
return num_not_equal / (2.0 * num_true_true + num_not_equal)
- at numba.njit(fastmath=True, cache=True)
+ at numba.njit(fastmath=True)
def kulsinski(x, y):
num_true_true = 0.0
num_not_equal = 0.0
@@ -300,7 +301,7 @@ def kulsinski(x, y):
)
- at numba.njit(fastmath=True, cache=True)
+ at numba.njit(fastmath=True)
def rogers_tanimoto(x, y):
num_not_equal = 0.0
for i in range(x.shape[0]):
@@ -311,7 +312,7 @@ def rogers_tanimoto(x, y):
return (2.0 * num_not_equal) / (x.shape[0] + num_not_equal)
- at numba.njit(fastmath=True, cache=True)
+ at numba.njit(fastmath=True)
def russellrao(x, y):
num_true_true = 0.0
for i in range(x.shape[0]):
@@ -325,7 +326,7 @@ def russellrao(x, y):
return float(x.shape[0] - num_true_true) / (x.shape[0])
- at numba.njit(fastmath=True, cache=True)
+ at numba.njit(fastmath=True)
def sokal_michener(x, y):
num_not_equal = 0.0
for i in range(x.shape[0]):
@@ -336,7 +337,7 @@ def sokal_michener(x, y):
return (2.0 * num_not_equal) / (x.shape[0] + num_not_equal)
- at numba.njit(fastmath=True, cache=True)
+ at numba.njit(fastmath=True)
def sokal_sneath(x, y):
num_true_true = 0.0
num_not_equal = 0.0
@@ -352,7 +353,7 @@ def sokal_sneath(x, y):
return num_not_equal / (0.5 * num_true_true + num_not_equal)
- at numba.njit(fastmath=True, cache=True)
+ at numba.njit(fastmath=True)
def haversine(x, y):
if x.shape[0] != 2:
raise ValueError("haversine is only defined for 2 dimensional graph_data")
@@ -362,7 +363,7 @@ def haversine(x, y):
return 2.0 * np.arcsin(result)
- at numba.njit(fastmath=True, cache=True)
+ at numba.njit(fastmath=True)
def yule(x, y):
num_true_true = 0.0
num_true_false = 0.0
@@ -384,7 +385,7 @@ def yule(x, y):
)
- at numba.njit(fastmath=True, cache=True)
+ at numba.njit(fastmath=True)
def cosine(x, y):
result = 0.0
norm_x = 0.0
@@ -546,7 +547,7 @@ def true_angular_from_alt_cosine(d):
return 1.0 - (np.arccos(pow(2.0, -d)) / np.pi)
- at numba.njit(fastmath=True, cache=True)
+ at numba.njit(fastmath=True)
def correlation(x, y):
mu_x = 0.0
mu_y = 0.0
@@ -728,9 +729,7 @@ def kantorovich(x, y, cost=_dummy_cost, max_iter=100000):
sub_cost = cost[row_mask, :][:, col_mask]
node_arc_data, spanning_tree, graph = allocate_graph_structures(
- a.shape[0],
- b.shape[0],
- False,
+ a.shape[0], b.shape[0], False
)
initialize_supply(a, -b, graph, node_arc_data.supply)
initialize_cost(sub_cost, graph, node_arc_data.cost)
@@ -740,12 +739,7 @@ def kantorovich(x, y, cost=_dummy_cost, max_iter=100000):
raise ValueError(
"Kantorovich distance inputs must be valid probability distributions."
)
- solve_status = network_simplex_core(
- node_arc_data,
- spanning_tree,
- graph,
- max_iter,
- )
+ solve_status = network_simplex_core(node_arc_data, spanning_tree, graph, max_iter)
# if solve_status == ProblemStatus.MAX_ITER_REACHED:
# print("WARNING: RESULT MIGHT BE INACCURATE\nMax number of iteration reached!")
if solve_status == ProblemStatus.INFEASIBLE:
@@ -761,6 +755,146 @@ def kantorovich(x, y, cost=_dummy_cost, max_iter=100000):
return result
+ at numba.njit(fastmath=True)
+def sinkhorn(x, y, cost=_dummy_cost, regularization=1.0):
+ row_mask = x != 0
+ col_mask = y != 0
+
+ a = x[row_mask].astype(np.float64)
+ b = y[col_mask].astype(np.float64)
+
+ a_sum = a.sum()
+ b_sum = b.sum()
+
+ a /= a_sum
+ b /= b_sum
+
+ sub_cost = cost[row_mask, :][:, col_mask]
+
+ transport_plan = sinkhorn_transport_plan(
+ x, y, cost=sub_cost, regularization=regularization
+ )
+ dim_i = transport_plan.shape[0]
+ dim_j = transport_plan.shape[1]
+ result = 0.0
+ for i in range(dim_i):
+ for j in range(dim_j):
+ result += transport_plan[i, j] * cost[i, j]
+
+ return result
+
+
+ at numba.njit()
+def jensen_shannon_divergence(x, y):
+ result = 0.0
+ l1_norm_x = 0.0
+ l1_norm_y = 0.0
+ dim = x.shape[0]
+
+ for i in range(dim):
+ l1_norm_x += x[i]
+ l1_norm_y += y[i]
+
+ l1_norm_x += FLOAT32_EPS * dim
+ l1_norm_y += FLOAT32_EPS * dim
+
+ pdf_x = (x + FLOAT32_EPS) / l1_norm_x
+ pdf_y = (y + FLOAT32_EPS) / l1_norm_y
+ m = 0.5 * (pdf_x + pdf_y)
+
+ for i in range(dim):
+ result += 0.5 * (
+ pdf_x[i] * np.log(pdf_x[i] / m[i]) + pdf_y[i] * np.log(pdf_y[i] / m[i])
+ )
+
+ return result
+
+
+ at numba.njit()
+def wasserstein_1d(x, y, p=1):
+ x_sum = 0.0
+ y_sum = 0.0
+ for i in range(x.shape[0]):
+ x_sum += x[i]
+ y_sum += y[i]
+
+ x_cdf = x / x_sum
+ y_cdf = y / y_sum
+
+ for i in range(1, x_cdf.shape[0]):
+ x_cdf[i] += x_cdf[i - 1]
+ y_cdf[i] += y_cdf[i - 1]
+
+ return minkowski(x_cdf, y_cdf, p)
+
+
+ at numba.njit()
+def circular_kantorovich(x, y, p=1):
+ x_sum = 0.0
+ y_sum = 0.0
+ for i in range(x.shape[0]):
+ x_sum += x[i]
+ y_sum += y[i]
+
+ x_cdf = x / x_sum
+ y_cdf = y / y_sum
+
+ for i in range(1, x_cdf.shape[0]):
+ x_cdf[i] += x_cdf[i - 1]
+ y_cdf[i] += y_cdf[i - 1]
+
+ mu = np.median((x_cdf - y_cdf) ** p)
+
+ # Now we just want minkowski distance on the CDFs shifted by mu
+ result = 0.0
+ if p > 2:
+ for i in range(x_cdf.shape[0]):
+ result += np.abs(x_cdf[i] - y_cdf[i] - mu) ** p
+
+ return result ** (1.0 / p)
+
+ elif p == 2:
+ for i in range(x_cdf.shape[0]):
+ val = x_cdf[i] - y_cdf[i] - mu
+ result += val * val
+
+ return np.sqrt(result)
+
+ elif p == 1:
+ for i in range(x_cdf.shape[0]):
+ result += np.abs(x_cdf[i] - y_cdf[i] - mu)
+
+ return result
+
+ else:
+ raise ValueError("Invalid p supplied to Kantorvich distance")
+
+
+ at numba.njit()
+def symmetric_kl_divergence(x, y):
+ result = 0.0
+ l1_norm_x = 0.0
+ l1_norm_y = 0.0
+ dim = x.shape[0]
+
+ for i in range(dim):
+ l1_norm_x += x[i]
+ l1_norm_y += y[i]
+
+ l1_norm_x += FLOAT32_EPS * dim
+ l1_norm_y += FLOAT32_EPS * dim
+
+ pdf_x = (x + FLOAT32_EPS) / l1_norm_x
+ pdf_y = (y + FLOAT32_EPS) / l1_norm_y
+
+ for i in range(dim):
+ result += pdf_x[i] * np.log(pdf_x[i] / pdf_y[i]) + pdf_y[i] * np.log(
+ pdf_y[i] / pdf_x[i]
+ )
+
+ return result
+
+
named_distances = {
# general minkowski distances
"euclidean": euclidean,
@@ -785,14 +919,27 @@ named_distances = {
"cosine": cosine,
"dot": dot,
"correlation": correlation,
- "hellinger": hellinger,
"haversine": haversine,
"braycurtis": bray_curtis,
"spearmanr": spearmanr,
- "kantorovich": kantorovich,
- "wasserstein": kantorovich,
"tsss": tsss,
"true_angular": true_angular,
+ # Distribution distances
+ "hellinger": hellinger,
+ "kantorovich": kantorovich,
+ "wasserstein": kantorovich,
+ "wasserstein_1d": wasserstein_1d,
+ "wasserstein-1d": wasserstein_1d,
+ "kantorovich-1d": wasserstein_1d,
+ "kantorovich_1d": wasserstein_1d,
+ "circular_kantorovich": circular_kantorovich,
+ "circular_wasserstein": circular_kantorovich,
+ "sinkhorn": sinkhorn,
+ "jensen-shannon": jensen_shannon_divergence,
+ "jensen_shannon": jensen_shannon_divergence,
+ "symmetric-kl": symmetric_kl_divergence,
+ "symmetric_kl": symmetric_kl_divergence,
+ "symmetric_kullback_liebler": symmetric_kl_divergence,
# Binary distances
"hamming": hamming,
"jaccard": jaccard,
=====================================
pynndescent/graph_utils.py
=====================================
@@ -53,13 +53,7 @@ def create_component_search(index):
"seed_scale": numba.types.float32,
},
)
- def custom_search_closure(
- query_points,
- candidate_indices,
- k,
- epsilon,
- visited,
- ):
+ def custom_search_closure(query_points, candidate_indices, k, epsilon, visited):
result = make_heap(query_points.shape[0], k)
distance_scale = 1.0 + epsilon
@@ -180,8 +174,7 @@ def adjacency_matrix_representation(neighbor_indices, neighbor_distances):
neighbor_distances[neighbor_distances == 0.0] = FLOAT32_EPS
result.row = np.repeat(
- np.arange(neighbor_indices.shape[0], dtype=np.int32),
- neighbor_indices.shape[1],
+ np.arange(neighbor_indices.shape[0], dtype=np.int32), neighbor_indices.shape[1]
)
result.col = neighbor_indices.ravel()
result.data = neighbor_distances.ravel()
=====================================
pynndescent/optimal_transport.py
=====================================
@@ -31,6 +31,9 @@ import numba
from collections import namedtuple
from enum import Enum, IntEnum
+_mock_identity = np.eye(2, dtype=np.float32)
+_mock_ones = np.ones(2, dtype=np.float32)
+_dummy_cost = np.zeros((2, 2), dtype=np.float64)
# Accuracy tolerance and net supply tolerance
EPSILON = 2.2204460492503131e-15
@@ -223,14 +226,9 @@ def find_join_node(source, target, succ_num, parent, in_arc):
"second": numba.uint16,
"result": numba.uint8,
"in_arc": numba.uint32,
- },
+ }
)
-def find_leaving_arc(
- join,
- in_arc,
- node_arc_data,
- spanning_tree,
-):
+def find_leaving_arc(join, in_arc, node_arc_data, spanning_tree):
source = node_arc_data.source
target = node_arc_data.target
flow = node_arc_data.flow
@@ -299,20 +297,8 @@ def find_leaving_arc(
# Change _flow and _state vectors
# locals: val, u
# modifies: _state, _flow
- at numba.njit(
- locals={
- "u": numba.uint16,
- "in_arc": numba.uint32,
- "val": numba.float64,
- },
-)
-def update_flow(
- join,
- leaving_arc_data,
- node_arc_data,
- spanning_tree,
- in_arc,
-):
+ at numba.njit(locals={"u": numba.uint16, "in_arc": numba.uint32, "val": numba.float64})
+def update_flow(join, leaving_arc_data, node_arc_data, spanning_tree, in_arc):
source = node_arc_data.source
target = node_arc_data.target
flow = node_arc_data.flow
@@ -372,15 +358,9 @@ def update_flow(
"new_stem": numba.uint16,
"par_stem": numba.uint16,
"in_arc": numba.uint32,
- },
+ }
)
-def update_spanning_tree(
- spanning_tree,
- leaving_arc_data,
- join,
- in_arc,
- source,
-):
+def update_spanning_tree(spanning_tree, leaving_arc_data, join, in_arc, source):
parent = spanning_tree.parent
thread = spanning_tree.thread
@@ -863,12 +843,7 @@ def total_cost(flow, cost):
@numba.njit(nogil=True)
-def network_simplex_core(
- node_arc_data,
- spanning_tree,
- graph,
- max_iter,
-):
+def network_simplex_core(node_arc_data, spanning_tree, graph, max_iter):
# pivot_block = PivotBlock(
# max(np.int32(np.sqrt(graph.n_arcs)), 10),
@@ -949,3 +924,271 @@ def network_simplex_core(
pi[i] -= max_pot
return solution_status
+
+
+#######################################################
+# SINKHORN distances in various variations
+#######################################################
+
+
+ at numba.njit(
+ fastmath=True,
+ parallel=True,
+ locals={"diff": numba.float32, "result": numba.float32},
+ cache=True,
+)
+def right_marginal_error(u, K, v, y):
+ uK = u @ K
+ result = 0.0
+ for i in numba.prange(uK.shape[0]):
+ diff = y[i] - uK[i] * v[i]
+ result += diff * diff
+ return np.sqrt(result)
+
+
+ at numba.njit(
+ fastmath=True,
+ parallel=True,
+ locals={"diff": numba.float32, "result": numba.float32},
+ cache=True,
+)
+def right_marginal_error_batch(u, K, v, y):
+ uK = K.T @ u
+ result = 0.0
+ for i in numba.prange(uK.shape[0]):
+ for j in range(uK.shape[1]):
+ diff = y[j, i] - uK[i, j] * v[i, j]
+ result += diff * diff
+ return np.sqrt(result)
+
+
+ at numba.njit(fastmath=True, parallel=True, cache=True)
+def transport_plan(K, u, v):
+ i_dim = K.shape[0]
+ j_dim = K.shape[1]
+ result = np.empty_like(K)
+ for i in numba.prange(i_dim):
+ for j in range(j_dim):
+ result[i, j] = u[i] * K[i, j] * v[j]
+
+ return result
+
+
+ at numba.njit(fastmath=True, parallel=True, locals={"result": numba.float32}, cache=True)
+def relative_change_in_plan(old_u, old_v, new_u, new_v):
+ i_dim = old_u.shape[0]
+ j_dim = old_v.shape[0]
+ result = 0.0
+ for i in numba.prange(i_dim):
+ for j in range(j_dim):
+ old_uv = old_u[i] * old_v[j]
+ result += np.float32(np.abs(old_uv - new_u[i] * new_v[j]) / old_uv)
+
+ return result / (i_dim * j_dim)
+
+
+ at numba.njit(fastmath=True, parallel=True, cache=True)
+def precompute_K_prime(K, x):
+ i_dim = K.shape[0]
+ j_dim = K.shape[1]
+ result = np.empty_like(K)
+ for i in numba.prange(i_dim):
+ if x[i] > 0.0:
+ x_i_inverse = 1.0 / x[i]
+ else:
+ x_i_inverse = INFINITY
+ for j in range(j_dim):
+ result[i, j] = x_i_inverse * K[i, j]
+
+ return result
+
+
+ at numba.njit(fastmath=True, parallel=True, cache=True)
+def K_from_cost(cost, regularization):
+ i_dim = cost.shape[0]
+ j_dim = cost.shape[1]
+ result = np.empty_like(cost)
+ for i in numba.prange(i_dim):
+ for j in range(j_dim):
+ scaled_cost = cost[i, j] / regularization
+ result[i, j] = np.exp(-scaled_cost)
+
+ return result
+
+
+ at numba.njit(fastmath=True, cache=True)
+def sinkhorn_iterations(
+ x, y, u, v, K, max_iter=1000, error_tolerance=1e-9, change_tolerance=1e-9
+):
+ K_prime = precompute_K_prime(K, x)
+
+ prev_u = u
+ prev_v = v
+
+ for iteration in range(max_iter):
+
+ next_v = y / (K.T @ u)
+
+ if np.any(~np.isfinite(next_v)):
+ break
+
+ next_u = 1.0 / (K_prime @ next_v)
+
+ if np.any(~np.isfinite(next_u)):
+ break
+
+ u = next_u
+ v = next_v
+
+ if iteration % 20 == 0:
+ # Check if values in plan have changed significantly since last 20 iterations
+ relative_change = relative_change_in_plan(prev_u, prev_v, next_u, next_v)
+ if relative_change <= change_tolerance:
+ break
+
+ prev_u = u
+ prev_v = v
+
+ if iteration % 10 == 0:
+ # Check if right marginal error is less than tolerance every 10 iterations
+ err = right_marginal_error(u, K, v, y)
+ if err <= error_tolerance:
+ break
+
+ return u, v
+
+
+ at numba.njit(fastmath=True, cache=True)
+def sinkhorn_iterations_batch(x, y, u, v, K, max_iter=1000, error_tolerance=1e-9):
+ K_prime = precompute_K_prime(K, x)
+
+ for iteration in range(max_iter):
+
+ next_v = y.T / (K.T @ u)
+
+ if np.any(~np.isfinite(next_v)):
+ break
+
+ next_u = 1.0 / (K_prime @ next_v)
+
+ if np.any(~np.isfinite(next_u)):
+ break
+
+ u = next_u
+ v = next_v
+
+ if iteration % 10 == 0:
+ # Check if right marginal error is less than tolerance every 10 iterations
+ err = right_marginal_error_batch(u, K, v, y)
+ if err <= error_tolerance:
+ break
+
+ return u, v
+
+
+ at numba.njit(fastmath=True, cache=True)
+def sinkhorn_transport_plan(
+ x,
+ y,
+ cost=_dummy_cost,
+ regularization=1.0,
+ max_iter=1000,
+ error_tolerance=1e-9,
+ change_tolerance=1e-9,
+):
+ dim_x = x.shape[0]
+ dim_y = y.shape[0]
+ u = np.full(dim_x, 1.0 / dim_x, dtype=cost.dtype)
+ v = np.full(dim_y, 1.0 / dim_y, dtype=cost.dtype)
+
+ K = K_from_cost(cost, regularization)
+ u, v = sinkhorn_iterations(
+ x,
+ y,
+ u,
+ v,
+ K,
+ max_iter=max_iter,
+ error_tolerance=error_tolerance,
+ change_tolerance=change_tolerance,
+ )
+
+ return transport_plan(K, u, v)
+
+
+ at numba.njit(fastmath=True, cache=True)
+def sinkhorn_distance(x, y, cost=_dummy_cost, regularization=1.0):
+ transport_plan = sinkhorn_transport_plan(
+ x, y, cost=cost, regularization=regularization
+ )
+ dim_i = transport_plan.shape[0]
+ dim_j = transport_plan.shape[1]
+ result = 0.0
+ for i in range(dim_i):
+ for j in range(dim_j):
+ result += transport_plan[i, j] * cost[i, j]
+
+ return result
+
+
+ at numba.njit(fastmath=True, parallel=True, cache=True)
+def sinkhorn_distance_batch(x, y, cost=_dummy_cost, regularization=1.0):
+ dim_x = x.shape[0]
+ dim_y = y.shape[0]
+
+ batch_size = y.shape[1]
+
+ u = np.full((dim_x, batch_size), 1.0 / dim_x, dtype=cost.dtype)
+ v = np.full((dim_y, batch_size), 1.0 / dim_y, dtype=cost.dtype)
+
+ K = K_from_cost(cost, regularization)
+ u, v = sinkhorn_iterations_batch(
+ x,
+ y,
+ u,
+ v,
+ K,
+ )
+
+ i_dim = K.shape[0]
+ j_dim = K.shape[1]
+ result = np.zeros(batch_size)
+ for i in range(i_dim):
+ for j in range(j_dim):
+ K_times_cost = K[i, j] * cost[i, j]
+ for batch in range(batch_size):
+ result[batch] += u[i, batch] * K_times_cost * v[j, batch]
+
+ return result
+
+
+def make_fixed_cost_sinkhorn_distance(cost, regularization=1.0):
+
+ K = K_from_cost(cost, regularization)
+ dim_x = K.shape[0]
+ dim_y = K.shape[1]
+
+ @numba.njit(fastmath=True)
+ def closure(x, y):
+ u = np.full(dim_x, 1.0 / dim_x, dtype=cost.dtype)
+ v = np.full(dim_y, 1.0 / dim_y, dtype=cost.dtype)
+
+ K = K_from_cost(cost, regularization)
+ u, v = sinkhorn_iterations(
+ x,
+ y,
+ u,
+ v,
+ K,
+ )
+
+ current_plan = transport_plan(K, u, v)
+
+ result = 0.0
+ for i in range(dim_x):
+ for j in range(dim_y):
+ result += current_plan[i, j] * cost[i, j]
+
+ return result
+
+ return closure
=====================================
pynndescent/pynndescent_.py
=====================================
@@ -9,12 +9,7 @@ import numpy as np
from sklearn.utils import check_random_state, check_array
from sklearn.preprocessing import normalize
from sklearn.base import BaseEstimator, TransformerMixin
-from scipy.sparse import (
- csr_matrix,
- coo_matrix,
- isspmatrix_csr,
- vstack as sparse_vstack,
-)
+from scipy.sparse import csr_matrix, coo_matrix, isspmatrix_csr, vstack as sparse_vstack, issparse
import heapq
@@ -30,7 +25,6 @@ from pynndescent.utils import (
new_build_candidates,
ts,
simple_heap_push,
- flagged_heap_push,
checked_flagged_heap_push,
has_been_visited,
mark_visited,
@@ -64,7 +58,12 @@ FLOAT32_EPS = np.finfo(np.float32).eps
EMPTY_GRAPH = make_heap(1, 1)
- at numba.njit(parallel=True)
+def is_c_contiguous(array_like):
+ flags = getattr(array_like, "flags", None)
+ return flags is not None and flags["C_CONTIGUOUS"]
+
+
+ at numba.njit(parallel=True, cache=True)
def generate_leaf_updates(leaf_block, dist_thresholds, data, dist):
updates = [[(-1, -1, np.inf)] for i in range(leaf_block.shape[0])]
@@ -87,13 +86,7 @@ def generate_leaf_updates(leaf_block, dist_thresholds, data, dist):
return updates
- at numba.njit(
- locals={
- "d": numba.float32,
- "p": numba.int32,
- "q": numba.int32,
- }
-)
+ at numba.njit(locals={"d": numba.float32, "p": numba.int32, "q": numba.int32}, cache=True)
def init_rp_tree(data, dist, current_graph, leaf_array):
n_leaves = leaf_array.shape[0]
@@ -107,12 +100,7 @@ def init_rp_tree(data, dist, current_graph, leaf_array):
leaf_block = leaf_array[block_start:block_end]
dist_thresholds = current_graph[1][:, 0]
- updates = generate_leaf_updates(
- leaf_block,
- dist_thresholds,
- data,
- dist,
- )
+ updates = generate_leaf_updates(leaf_block, dist_thresholds, data, dist)
for j in range(len(updates)):
for k in range(len(updates[j])):
@@ -121,8 +109,6 @@ def init_rp_tree(data, dist, current_graph, leaf_array):
if p == -1 or q == -1:
continue
- # heap_push(current_graph, p, d, q, 1)
- # heap_push(current_graph, q, d, p, 1)
checked_flagged_heap_push(
current_graph[1][p],
current_graph[0][p],
@@ -142,7 +128,9 @@ def init_rp_tree(data, dist, current_graph, leaf_array):
@numba.njit(
- fastmath=True, locals={"d": numba.float32, "idx": numba.int32, "i": numba.int32}
+ fastmath=True,
+ locals={"d": numba.float32, "idx": numba.int32, "i": numba.int32},
+ cache=True,
)
def init_random(n_neighbors, data, heap, dist, rng_state):
for i in range(data.shape[0]):
@@ -150,7 +138,6 @@ def init_random(n_neighbors, data, heap, dist, rng_state):
for j in range(n_neighbors - np.sum(heap[0][i] >= 0.0)):
idx = np.abs(tau_rand_int(rng_state)) % data.shape[0]
d = dist(data[idx], data[i])
- # heap_push(heap, i, d, idx, 1)
checked_flagged_heap_push(
heap[1][i], heap[0][i], heap[2][i], d, idx, np.uint8(1)
)
@@ -158,25 +145,20 @@ def init_random(n_neighbors, data, heap, dist, rng_state):
return
- at numba.njit()
+ at numba.njit(cache=True)
def init_from_neighbor_graph(heap, indices, distances):
for p in range(indices.shape[0]):
for k in range(indices.shape[1]):
q = indices[p, k]
d = distances[p, k]
- # unchecked_heap_push(heap, p, d, q, 0)
- flagged_heap_push(heap[0][p], heap[1][p], heap[2][p], q, d, 0)
+ checked_flagged_heap_push(heap[1][p], heap[0][p], heap[2][p], d, q, 0)
return
- at numba.njit(parallel=True)
+ at numba.njit(parallel=True, cache=True)
def generate_graph_updates(
- new_candidate_block,
- old_candidate_block,
- dist_thresholds,
- data,
- dist,
+ new_candidate_block, old_candidate_block, dist_thresholds, data, dist
):
block_size = new_candidate_block.shape[0]
@@ -210,7 +192,7 @@ def generate_graph_updates(
return updates
- at numba.njit()
+ at numba.njit(cache=True)
def process_candidates(
data,
dist,
@@ -219,6 +201,7 @@ def process_candidates(
old_candidate_neighbors,
n_blocks,
block_size,
+ n_threads,
):
c = 0
n_vertices = new_candidate_neighbors.shape[0]
@@ -232,14 +215,10 @@ def process_candidates(
dist_thresholds = current_graph[1][:, 0]
updates = generate_graph_updates(
- new_candidate_block,
- old_candidate_block,
- dist_thresholds,
- data,
- dist,
+ new_candidate_block, old_candidate_block, dist_thresholds, data, dist
)
- c += apply_graph_updates_low_memory(current_graph, updates)
+ c += apply_graph_updates_low_memory(current_graph, updates, n_threads)
return c
@@ -259,15 +238,14 @@ def nn_descent_internal_low_memory_parallel(
n_vertices = data.shape[0]
block_size = 16384
n_blocks = n_vertices // block_size
+ n_threads = numba.get_num_threads()
for n in range(n_iters):
if verbose:
print("\t", n + 1, " / ", n_iters)
(new_candidate_neighbors, old_candidate_neighbors) = new_build_candidates(
- current_graph,
- max_candidates,
- rng_state,
+ current_graph, max_candidates, rng_state, n_threads
)
c = process_candidates(
@@ -278,6 +256,7 @@ def nn_descent_internal_low_memory_parallel(
old_candidate_neighbors,
n_blocks,
block_size,
+ n_threads,
)
if c <= delta * n_neighbors * data.shape[0]:
@@ -301,6 +280,7 @@ def nn_descent_internal_high_memory_parallel(
n_vertices = data.shape[0]
block_size = 16384
n_blocks = n_vertices // block_size
+ n_threads = numba.get_num_threads()
in_graph = [
set(current_graph[0][i].astype(np.int64))
@@ -312,9 +292,7 @@ def nn_descent_internal_high_memory_parallel(
print("\t", n + 1, " / ", n_iters)
(new_candidate_neighbors, old_candidate_neighbors) = new_build_candidates(
- current_graph,
- max_candidates,
- rng_state,
+ current_graph, max_candidates, rng_state, n_threads
)
c = 0
@@ -327,11 +305,7 @@ def nn_descent_internal_high_memory_parallel(
dist_thresholds = current_graph[1][:, 0]
updates = generate_graph_updates(
- new_candidate_block,
- old_candidate_block,
- dist_thresholds,
- data,
- dist,
+ new_candidate_block, old_candidate_block, dist_thresholds, data, dist
)
c += apply_graph_updates_high_memory(current_graph, updates, in_graph)
@@ -465,8 +439,7 @@ def diversify_csr(
if retained[l] == 1:
d = dist(
- source_data[current_indices[j]],
- source_data[current_indices[k]],
+ source_data[current_indices[j]], source_data[current_indices[k]]
)
if current_data[l] > FLOAT32_EPS and d < current_data[j]:
if tau_rand(rng_state) < prune_probability:
@@ -528,7 +501,7 @@ def resort_tree_indices(tree, tree_order):
return new_tree
-class NNDescent(object):
+class NNDescent:
"""NNDescent for fast approximate nearest neighbor queries. NNDescent is
very flexible and supports a wide variety of distances, including
non-metric distances. NNDescent also scales well against high dimensional
@@ -537,7 +510,7 @@ class NNDescent(object):
Parameters
----------
- data: array os shape (n_samples, n_features)
+ data: array of shape (n_samples, n_features)
The training graph_data set to find nearest neighbors in.
metric: string or callable (optional, default='euclidean')
@@ -566,6 +539,7 @@ class NNDescent(object):
* sokalsneath
* yule
* hellinger
+ * wasserstein-1d
Metrics that take arguments (such as minkowski, mahalanobis etc.)
can have arguments passed via the metric_kwds dictionary. At this
time care must be taken and dictionary elements must be ordered
@@ -605,6 +579,12 @@ class NNDescent(object):
that no edges get removed, and larger values result in significantly more
aggressive edge removal. A value of 1.0 will prune all edges that it can.
+ n_search_trees: int (optional, default=1)
+ The number of random projection trees to use in initializing searching or
+ querying.
+
+ .. deprecated:: 0.5.5
+
tree_init: bool (optional, default=True)
Whether to use random projection trees for initialization.
@@ -621,14 +601,11 @@ class NNDescent(object):
The``'alternative'`` algorithm has been deprecated and is no longer
available.
- low_memory: boolean (optional, default=False)
+ low_memory: boolean (optional, default=True)
Whether to use a lower memory, but more computationally expensive
- approach to index construction. This defaults to false as for most
- cases it speeds index construction, but if you are having issues
- with excessive memory use for your dataset consider setting this
- to True.
+ approach to index construction.
- max_candidates: int (optional, default=20)
+ max_candidates: int (optional, default=None)
Internally each "self-join" keeps a maximum number of candidates (
nearest neighbors and reverse nearest neighbors) to be considered.
This value controls this aspect of the algorithm. Larger values will
@@ -659,6 +636,11 @@ class NNDescent(object):
result in a significantly smaller index, particularly useful for saving,
but will remove information that might otherwise be useful.
+ parallel_batch_queries: bool (optional, default=False)
+ Whether to use parallelism of batched queries. This can be useful for large
+ batches of queries on multicore machines, but results in performance degradation
+ for single queries, so is poor for streaming use.
+
verbose: bool (optional, default=False)
Whether to print status graph_data during the computation.
"""
@@ -682,7 +664,8 @@ class NNDescent(object):
n_iters=None,
delta=0.001,
n_jobs=None,
- compressed=True,
+ compressed=False,
+ parallel_batch_queries=False,
verbose=False,
):
@@ -707,8 +690,14 @@ class NNDescent(object):
self.dim = data.shape[1]
self.n_jobs = n_jobs
self.compressed = compressed
+ self.parallel_batch_queries = parallel_batch_queries
self.verbose = verbose
+ if getattr(data, "dtype", None) == np.float32 and (issparse(data) or is_c_contiguous(data)):
+ copy_on_normalize = True
+ else:
+ copy_on_normalize = False
+
data = check_array(data, dtype=np.float32, accept_sparse="csr", order="C")
self._raw_data = data
@@ -765,7 +754,7 @@ class NNDescent(object):
self._angular_trees = False
if metric == "dot":
- data = normalize(data, norm="l2", copy=False)
+ data = normalize(data, norm="l2", copy=copy_on_normalize)
self.rng_state = current_random_state.randint(INT32_MIN, INT32_MAX, 3).astype(
np.int64
@@ -840,13 +829,7 @@ class NNDescent(object):
@numba.njit()
def _partial_dist_func(ind1, data1, ind2, data2):
- return _distance_func(
- ind1,
- data1,
- ind2,
- data2,
- *dist_args,
- )
+ return _distance_func(ind1, data1, ind2, data2, *dist_args)
self._distance_func = _partial_dist_func
else:
@@ -932,7 +915,10 @@ class NNDescent(object):
if not hasattr(self, "_search_graph"):
self._init_search_graph()
if not hasattr(self, "_search_function"):
- self._init_search_function()
+ if self._is_sparse:
+ self._init_sparse_search_function()
+ else:
+ self._init_search_function()
result = self.__dict__.copy()
if hasattr(self, "_rp_forest"):
del result["_rp_forest"]
@@ -946,7 +932,10 @@ class NNDescent(object):
self._search_forest = tuple(
[renumbaify_tree(tree) for tree in d["_search_forest"]]
)
- self._init_search_function()
+ if self._is_sparse:
+ self._init_sparse_search_function()
+ else:
+ self._init_search_function()
def _init_search_graph(self):
@@ -1118,8 +1107,7 @@ class NNDescent(object):
print(
ts(),
"Degree pruning reduced edges from {} to {}".format(
- pre_prune_nnz,
- self._search_graph.nnz,
+ pre_prune_nnz, self._search_graph.nnz
),
)
@@ -1171,7 +1159,7 @@ class NNDescent(object):
numba.types.Array(numba.types.int32, 1, "C", readonly=True)(
numba.types.Array(numba.types.float32, 1, "C", readonly=True),
numba.types.Array(numba.types.int64, 1, "C", readonly=False),
- ),
+ )
],
locals={"node": numba.types.uint32, "side": numba.types.boolean},
)
@@ -1198,6 +1186,7 @@ class NNDescent(object):
indices = self._search_graph.indices
dist = self._distance_func
n_neighbors = self.n_neighbors
+ parallel_search = self.parallel_batch_queries
@numba.njit(
fastmath=True,
@@ -1220,21 +1209,22 @@ class NNDescent(object):
"distance_bound": numba.types.float32,
"seed_scale": numba.types.float32,
},
+ parallel=self.parallel_batch_queries,
)
- def search_closure(
- query_points,
- k,
- epsilon,
- visited,
- rng_state,
- ):
+ def search_closure(query_points, k, epsilon, visited, rng_state):
result = make_heap(query_points.shape[0], k)
distance_scale = 1.0 + epsilon
internal_rng_state = np.copy(rng_state)
- for i in range(query_points.shape[0]):
- visited[:] = 0
+ for i in numba.prange(query_points.shape[0]):
+ # Avoid races on visited if parallel
+ if parallel_search:
+ visited_nodes = np.zeros_like(visited)
+ else:
+ visited_nodes = visited
+ visited_nodes[:] = 0
+
if dist == alternative_dot or dist == alternative_cosine:
norm = np.sqrt((query_points[i] ** 2).sum())
if norm > 0.0:
@@ -1258,24 +1248,24 @@ class NNDescent(object):
for j in range(n_initial_points):
candidate = candidate_indices[j]
- d = dist(data[candidate], current_query)
+ d = np.float32(dist(data[candidate], current_query))
# indices are guaranteed different
simple_heap_push(heap_priorities, heap_indices, d, candidate)
heapq.heappush(seed_set, (d, candidate))
- mark_visited(visited, candidate)
+ mark_visited(visited_nodes, candidate)
if n_random_samples > 0:
for j in range(n_random_samples):
candidate = np.int32(
np.abs(tau_rand_int(internal_rng_state)) % data.shape[0]
)
- if has_been_visited(visited, candidate) == 0:
- d = dist(data[candidate], current_query)
+ if has_been_visited(visited_nodes, candidate) == 0:
+ d = np.float32(dist(data[candidate], current_query))
simple_heap_push(
heap_priorities, heap_indices, d, candidate
)
heapq.heappush(seed_set, (d, candidate))
- mark_visited(visited, candidate)
+ mark_visited(visited_nodes, candidate)
############ Search ##############
distance_bound = distance_scale * heap_priorities[0]
@@ -1289,10 +1279,10 @@ class NNDescent(object):
candidate = indices[j]
- if has_been_visited(visited, candidate) == 0:
- mark_visited(visited, candidate)
+ if has_been_visited(visited_nodes, candidate) == 0:
+ mark_visited(visited_nodes, candidate)
- d = dist(data[candidate], current_query)
+ d = np.float32(dist(data[candidate], current_query))
if d < distance_bound:
simple_heap_push(
@@ -1314,11 +1304,7 @@ class NNDescent(object):
# Force compilation of the search function (hardcoded k, epsilon)
query_data = self._raw_data[:1]
_ = self._search_function(
- query_data,
- 5,
- 0.0,
- self._visited,
- self.search_rng_state,
+ query_data, 5, 0.0, self._visited, self.search_rng_state
)
def _init_sparse_search_function(self):
@@ -1337,7 +1323,7 @@ class NNDescent(object):
numba.types.Array(numba.types.int32, 1, "C", readonly=True),
numba.types.Array(numba.types.float32, 1, "C", readonly=True),
numba.types.Array(numba.types.int64, 1, "C", readonly=False),
- ),
+ )
],
locals={"node": numba.types.uint32, "side": numba.types.boolean},
)
@@ -1369,6 +1355,7 @@ class NNDescent(object):
indices = self._search_graph.indices
dist = self._distance_func
n_neighbors = self.n_neighbors
+ parallel_search = self.parallel_batch_queries
@numba.njit(
fastmath=True,
@@ -1387,15 +1374,10 @@ class NNDescent(object):
"distance_scale": numba.types.float32,
"seed_scale": numba.types.float32,
},
+ parallel=self.parallel_batch_queries,
)
def search_closure(
- query_inds,
- query_indptr,
- query_data,
- k,
- epsilon,
- visited,
- rng_state,
+ query_inds, query_indptr, query_data, k, epsilon, visited, rng_state
):
n_query_points = query_indptr.shape[0] - 1
@@ -1404,8 +1386,13 @@ class NNDescent(object):
distance_scale = 1.0 + epsilon
internal_rng_state = np.copy(rng_state)
- for i in range(n_query_points):
- visited[:] = 0
+ for i in numba.prange(n_query_points):
+ # Avoid races on visited if parallel
+ if parallel_search:
+ visited_nodes = np.zeros_like(visited)
+ else:
+ visited_nodes = visited
+ visited_nodes[:] = 0
current_query_inds = query_inds[query_indptr[i] : query_indptr[i + 1]]
current_query_data = query_data[query_indptr[i] : query_indptr[i + 1]]
@@ -1424,9 +1411,7 @@ class NNDescent(object):
############ Init ################
index_bounds = sparse_tree_search_closure(
- current_query_inds,
- current_query_data,
- internal_rng_state,
+ current_query_inds, current_query_data, internal_rng_state
)
candidate_indices = tree_indices[index_bounds[0] : index_bounds[1]]
@@ -1443,20 +1428,20 @@ class NNDescent(object):
data_indptr[candidate] : data_indptr[candidate + 1]
]
- d = dist(
+ d = np.float32(dist(
from_inds, from_data, current_query_inds, current_query_data
- )
+ ))
# indices are guaranteed different
simple_heap_push(heap_priorities, heap_indices, d, candidate)
heapq.heappush(seed_set, (d, candidate))
- mark_visited(visited, candidate)
+ mark_visited(visited_nodes, candidate)
if n_random_samples > 0:
for j in range(n_random_samples):
candidate = np.int32(
np.abs(tau_rand_int(internal_rng_state)) % n_index_points
)
- if has_been_visited(visited, candidate) == 0:
+ if has_been_visited(visited_nodes, candidate) == 0:
from_inds = data_inds[
data_indptr[candidate] : data_indptr[candidate + 1]
]
@@ -1464,18 +1449,18 @@ class NNDescent(object):
data_indptr[candidate] : data_indptr[candidate + 1]
]
- d = dist(
+ d = np.float32(dist(
from_inds,
from_data,
current_query_inds,
current_query_data,
- )
+ ))
simple_heap_push(
heap_priorities, heap_indices, d, candidate
)
heapq.heappush(seed_set, (d, candidate))
- mark_visited(visited, candidate)
+ mark_visited(visited_nodes, candidate)
############ Search ##############
distance_bound = distance_scale * heap_priorities[0]
@@ -1489,8 +1474,8 @@ class NNDescent(object):
candidate = indices[j]
- if has_been_visited(visited, candidate) == 0:
- mark_visited(visited, candidate)
+ if has_been_visited(visited_nodes, candidate) == 0:
+ mark_visited(visited_nodes, candidate)
from_inds = data_inds[
data_indptr[candidate] : data_indptr[candidate + 1]
@@ -1499,12 +1484,12 @@ class NNDescent(object):
data_indptr[candidate] : data_indptr[candidate + 1]
]
- d = dist(
+ d = np.float32(dist(
from_inds,
from_data,
current_query_inds,
current_query_data,
- )
+ ))
if d < distance_bound:
simple_heap_push(
@@ -1616,11 +1601,7 @@ class NNDescent(object):
query_data = np.asarray(query_data).astype(np.float32, order="C")
result = self._search_function(
- query_data,
- k,
- epsilon,
- self._visited,
- self.search_rng_state,
+ query_data, k, epsilon, self._visited, self.search_rng_state
)
else:
# Sparse case
@@ -1653,6 +1634,9 @@ class NNDescent(object):
return indices, dists
def update(self, X):
+ if not hasattr(self, "_search_graph"):
+ self._init_search_graph()
+
current_random_state = check_random_state(self.random_state)
rng_state = current_random_state.randint(INT32_MIN, INT32_MAX, 3).astype(
np.int64
@@ -1754,6 +1738,8 @@ class PyNNDescentTransformer(BaseEstimator, TransformerMixin):
* sokalmichener
* sokalsneath
* yule
+ * hellinger
+ * wasserstein-1d
Metrics that take arguments (such as minkowski, mahalanobis etc.)
can have arguments passed via the metric_kwds dictionary. At this
time care must be taken and dictionary elements must be ordered
@@ -1787,10 +1773,12 @@ class PyNNDescentTransformer(BaseEstimator, TransformerMixin):
that no edges get removed, and larger values result in significantly more
aggressive edge removal. A value of 1.0 will prune all edges that it can.
- n_search_trees: float (optional, default=1)
+ n_search_trees: int (optional, default=1)
The number of random projection trees to use in initializing searching or
querying.
+ .. deprecated:: 0.5.5
+
search_epsilon: float (optional, default=0.1)
When searching for nearest neighbors of a query point this values
controls the trade-off between accuracy and search cost. Larger values
@@ -1841,6 +1829,11 @@ class PyNNDescentTransformer(BaseEstimator, TransformerMixin):
and less accurate searching. Don't tweak this value unless you know
what you're doing.
+ parallel_batch_queries: bool (optional, default=False)
+ Whether to use parallelism of batched queries. This can be useful for large
+ batches of queries on multicore machines, but results in performance degradation
+ for single queries, so is poor for streaming use.
+
verbose: bool (optional, default=False)
Whether to print status graph_data during the computation.
@@ -1872,6 +1865,7 @@ class PyNNDescentTransformer(BaseEstimator, TransformerMixin):
max_candidates=None,
n_iters=None,
early_termination_value=0.001,
+ parallel_batch_queries=False,
verbose=False,
):
@@ -1891,6 +1885,7 @@ class PyNNDescentTransformer(BaseEstimator, TransformerMixin):
self.n_iters = n_iters
self.early_termination_value = early_termination_value
self.n_jobs = n_jobs
+ self.parallel_batch_queries = parallel_batch_queries
self.verbose = verbose
def fit(self, X, compress_index=True):
@@ -1939,6 +1934,7 @@ class PyNNDescentTransformer(BaseEstimator, TransformerMixin):
delta=self.early_termination_value,
n_jobs=self.n_jobs,
compressed=compress_index,
+ parallel_batch_queries=self.parallel_batch_queries,
verbose=self.verbose,
)
@@ -1975,8 +1971,7 @@ class PyNNDescentTransformer(BaseEstimator, TransformerMixin):
print(ts(), "Constructing neighbor matrix")
result = coo_matrix((n_samples_transform, self.n_samples_fit), dtype=np.float32)
result.row = np.repeat(
- np.arange(indices.shape[0], dtype=np.int32),
- indices.shape[1],
+ np.arange(indices.shape[0], dtype=np.int32), indices.shape[1]
)
result.col = indices.ravel()
result.data = distances.ravel()
=====================================
pynndescent/rp_trees.py
=====================================
@@ -6,11 +6,9 @@ from warnings import warn
import locale
import numpy as np
import numba
-from numba.core import types
-from numba.experimental import structref
import scipy.sparse
-from pynndescent.sparse import sparse_mul, sparse_diff, sparse_sum, arr_intersect
+from pynndescent.sparse import sparse_mul, sparse_diff, sparse_sum, arr_intersect, sparse_dot_product
from pynndescent.utils import tau_rand_int, norm
import joblib
@@ -548,13 +546,11 @@ def make_euclidean_tree(
offsets.append(offset)
children.append((np.int32(left_node_num), np.int32(right_node_num)))
point_indices.append(np.array([-1], dtype=np.int32))
- # print("Made a node in tree with", len(point_indices), "nodes")
else:
hyperplanes.append(np.array([-1.0], dtype=np.float32))
offsets.append(-np.inf)
children.append((np.int32(-1), np.int32(-1)))
point_indices.append(indices)
- # print("Made a leaf in tree with", len(point_indices), "nodes")
return
@@ -795,9 +791,7 @@ def make_dense_tree(data, rng_state, leaf_size=30, angular=False):
leaf_size,
)
- # print("Completed a tree")
result = FlatTree(hyperplanes, offsets, children, point_indices, leaf_size)
- # print("Tree type is:", numba.typeof(result))
return result
@@ -856,6 +850,7 @@ def make_sparse_tree(inds, indptr, spdata, rng_state, leaf_size=30, angular=Fals
"dim": numba.types.intp,
"d": numba.types.uint16,
},
+ cache=True,
)
def select_side(hyperplane, offset, point, rng_state):
margin = offset
@@ -888,6 +883,7 @@ def select_side(hyperplane, offset, point, rng_state):
),
],
locals={"node": numba.types.uint32, "side": numba.types.boolean},
+ cache=True,
)
def search_flat_tree(point, hyperplanes, offsets, children, indices, rng_state):
node = 0
@@ -901,7 +897,7 @@ def search_flat_tree(point, hyperplanes, offsets, children, indices, rng_state):
return indices[-children[node, 0] : -children[node, 1]]
- at numba.njit(fastmath=True)
+ at numba.njit(fastmath=True, cache=True)
def sparse_select_side(hyperplane, offset, point_inds, point_data, rng_state):
margin = offset
@@ -912,10 +908,7 @@ def sparse_select_side(hyperplane, offset, point_inds, point_data, rng_state):
hyperplane_inds = hyperplane[0, :hyperplane_size].astype(np.int32)
hyperplane_data = hyperplane[1, :hyperplane_size]
- _, aux_data = sparse_mul(hyperplane_inds, hyperplane_data, point_inds, point_data)
-
- for val in aux_data:
- margin += val
+ margin += sparse_dot_product(hyperplane_inds, hyperplane_data, point_inds, point_data)
if abs(margin) < EPS:
side = tau_rand_int(rng_state) % 2
@@ -929,7 +922,7 @@ def sparse_select_side(hyperplane, offset, point_inds, point_data, rng_state):
return 1
- at numba.njit(locals={"node": numba.types.uint32})
+ at numba.njit(locals={"node": numba.types.uint32}, cache=True)
def search_sparse_flat_tree(
point_inds, point_data, hyperplanes, offsets, children, indices, rng_state
):
@@ -975,7 +968,7 @@ def make_forest(
# print(ts(), "Started forest construction")
result = []
if leaf_size is None:
- leaf_size = max(10, n_neighbors)
+ leaf_size = max(10, np.int32(n_neighbors))
if n_jobs is None:
n_jobs = -1
@@ -1010,7 +1003,7 @@ def make_forest(
return tuple(result)
- at numba.njit(nogil=True)
+ at numba.njit(nogil=True, cache=True)
def get_leaves_from_tree(tree):
n_leaves = 0
for i in range(len(tree.children)):
@@ -1124,7 +1117,7 @@ def recursive_convert_sparse(
return node_num, leaf_start
- at numba.njit()
+ at numba.njit(cache=True)
def num_nodes_and_leaves(tree):
n_nodes = 0
n_leaves = 0
@@ -1138,7 +1131,7 @@ def num_nodes_and_leaves(tree):
return n_nodes, n_leaves
- at numba.njit()
+ at numba.njit(cache=True)
def dense_hyperplane_dim(hyperplanes):
for i in range(len(hyperplanes)):
if hyperplanes[i].shape[0] > 1:
@@ -1147,7 +1140,7 @@ def dense_hyperplane_dim(hyperplanes):
raise ValueError("No hyperplanes of adequate size were found!")
- at numba.njit()
+ at numba.njit(cache=True)
def sparse_hyperplane_dim(hyperplanes):
max_dim = 0
for i in range(len(hyperplanes)):
@@ -1226,6 +1219,7 @@ def renumbaify_tree(tree):
"result": numba.float32,
"i": numba.uint32,
},
+ cache=True,
)
def score_tree(tree, neighbor_indices, data, rng_state):
result = 0.0
@@ -1243,7 +1237,7 @@ def score_tree(tree, neighbor_indices, data, rng_state):
return result / numba.float32(neighbor_indices.shape[0])
- at numba.njit(nogil=True, parallel=True, locals={"node": numba.int32})
+ at numba.njit(nogil=True, parallel=True, locals={"node": numba.int32}, cache=True)
def score_linked_tree(tree, neighbor_indices):
result = 0.0
n_nodes = len(tree.children)
=====================================
pynndescent/sparse.py
=====================================
@@ -8,7 +8,11 @@ import numpy as np
import numba
from pynndescent.utils import norm, tau_rand
-from pynndescent.distances import kantorovich
+from pynndescent.distances import (
+ kantorovich,
+ jensen_shannon_divergence,
+ symmetric_kl_divergence,
+)
locale.setlocale(locale.LC_NUMERIC, "C")
@@ -16,14 +20,14 @@ FLOAT32_EPS = np.finfo(np.float32).eps
FLOAT32_MAX = np.finfo(np.float32).max
# Just reproduce a simpler version of numpy isclose (not numba supported yet)
- at numba.njit()
+ at numba.njit(cache=True)
def isclose(a, b, rtol=1.0e-5, atol=1.0e-8):
diff = np.abs(a - b)
return diff <= (atol + rtol * np.abs(b))
# Just reproduce a simpler version of numpy unique (not numba supported yet)
- at numba.njit()
+ at numba.njit(cache=True)
def arr_unique(arr):
aux = np.sort(arr)
flag = np.concatenate((np.ones(1, dtype=np.bool_), aux[1:] != aux[:-1]))
@@ -31,7 +35,7 @@ def arr_unique(arr):
# Just reproduce a simpler version of numpy union1d (not numba supported yet)
- at numba.njit()
+ at numba.njit(cache=True)
def arr_union(ar1, ar2):
if ar1.shape[0] == 0:
return ar2
@@ -43,12 +47,66 @@ def arr_union(ar1, ar2):
# Just reproduce a simpler version of numpy intersect1d (not numba supported
# yet)
- at numba.njit()
+ at numba.njit(cache=True)
def arr_intersect(ar1, ar2):
aux = np.concatenate((ar1, ar2))
aux.sort()
return aux[:-1][aux[1:] == aux[:-1]]
+# Some things require size of intersection; do this quickly; assume sorted arrays for speed
+ at numba.njit(
+ [
+ "i4(i4[:],i4[:])",
+ numba.types.int32(
+ numba.types.Array(numba.types.int32, 1, "C", readonly=True),
+ numba.types.Array(numba.types.int32, 1, "C", readonly=True),
+ ),
+ ],
+ locals={
+ "i1": numba.uint16,
+ "i2": numba.uint16,
+ }
+)
+def fast_intersection_size(ar1, ar2):
+ if ar1.shape[0] == 0 or ar2.shape[0] == 0:
+ return 0
+
+ # NOTE: We assume arrays are sorted; if they are not this will break
+ i1 = 0
+ i2 = 0
+ limit1 = ar1.shape[0] - 1
+ limit2 = ar2.shape[0] - 1
+ j1 = ar1[i1]
+ j2 = ar2[i2]
+
+ result = 0
+
+ while True:
+ if j1 == j2:
+ result += 1
+ if i1 < limit1:
+ i1 += 1
+ j1 = ar1[i1]
+ else:
+ break
+
+ if i2 < limit2:
+ i2 += 1
+ j2 = ar2[i2]
+ else:
+ break
+
+ elif j1 < j2 and i1 < limit1:
+ i1 += 1
+ j1 = ar1[i1]
+ elif j2 < j1 and i2 < limit2:
+ i2 += 1
+ j2 = ar2[i2]
+ else:
+ break
+
+ return result
+
@numba.njit(
[
@@ -62,7 +120,7 @@ def arr_intersect(ar1, ar2):
numba.types.Array(numba.types.float32, 1, "C", readonly=True),
numba.types.Array(numba.types.int32, 1, "C", readonly=True),
numba.types.Array(numba.types.float32, 1, "C", readonly=True),
- ),
+ )
],
fastmath=True,
locals={
@@ -74,6 +132,7 @@ def arr_intersect(ar1, ar2):
"j1": numba.types.int32,
"j2": numba.types.int32,
},
+ cache=True,
)
def sparse_sum(ind1, data1, ind2, data2):
result_size = ind1.shape[0] + ind2.shape[0]
@@ -138,11 +197,10 @@ def sparse_sum(ind1, data1, ind2, data2):
return result_ind, result_data
- at numba.njit()
+ at numba.njit(cache=True)
def sparse_diff(ind1, data1, ind2, data2):
return sparse_sum(ind1, data1, ind2, -data2)
-
@numba.njit(
[
# "Tuple((i4[::1],f4[::1]))(i4[::1],f4[::1],i4[::1],f4[::1])",
@@ -156,7 +214,7 @@ def sparse_diff(ind1, data1, ind2, data2):
numba.types.Array(numba.types.float32, 1, "C", readonly=True),
numba.types.Array(numba.types.int32, 1, "C", readonly=True),
numba.types.Array(numba.types.float32, 1, "C", readonly=True),
- ),
+ )
],
fastmath=True,
locals={
@@ -166,6 +224,7 @@ def sparse_diff(ind1, data1, ind2, data2):
"j1": numba.types.int32,
"j2": numba.types.int32,
},
+ cache=True,
)
def sparse_mul(ind1, data1, ind2, data2):
result_ind = numba.typed.List.empty_list(numba.types.int32)
@@ -193,8 +252,124 @@ def sparse_mul(ind1, data1, ind2, data2):
return result_ind, result_data
+ at numba.njit(
+ [
+ # "Tuple((i4[::1],f4[::1]))(i4[::1],f4[::1],i4[::1],f4[::1])",
+ numba.types.float32(
+ numba.types.Array(numba.types.int32, 1, "C", readonly=True),
+ numba.types.Array(numba.types.float32, 1, "C", readonly=True),
+ numba.types.Array(numba.types.int32, 1, "C", readonly=True),
+ numba.types.Array(numba.types.float32, 1, "C", readonly=True),
+ )
+ ],
+ fastmath=True,
+ locals={
+ "result": numba.types.float32,
+ "val": numba.types.float32,
+ "i1": numba.types.uint16,
+ "i2": numba.types.uint16,
+ "j1": numba.types.int32,
+ "j2": numba.types.int32,
+ },
+ cache=True,
+)
+def sparse_dot_product(ind1, data1, ind2, data2):
+ dim1 = ind1.shape[0]
+ dim2 = ind2.shape[0]
+
+ result = 0.0
+
+ i1 = 0
+ i2 = 0
+ j1 = ind1[i1]
+ j2 = ind2[i2]
+ # pass through both index lists
+ while True:
+ if j1 == j2:
+ val = data1[i1] * data2[i2]
+ result += val
+ i1 += 1
+ if i1 >= dim1:
+ return result
+ j1 = ind1[i1]
+ i2 += 1
+ if i2 >= dim2:
+ return result
+ j2 = ind2[i2]
+ elif j1 < j2:
+ i1 += 1
+ if i1 >= dim1:
+ return result
+ j1 = ind1[i1]
+ else:
+ i2 += 1
+ if i2 >= dim2:
+ return result
+ j2 = ind2[i2]
+
+ return result # unreachable
+
+# Return dense vectors supported on the union of the non-zero valued indices
@numba.njit()
+def dense_union(ind1, data1, ind2, data2):
+ result_ind = arr_union(ind1, ind2)
+ result_data1 = np.zeros(result_ind.shape[0], dtype=np.float32)
+ result_data2 = np.zeros(result_ind.shape[0], dtype=np.float32)
+
+ i1 = 0
+ i2 = 0
+ nnz = 0
+
+ # pass through both index lists
+ while i1 < ind1.shape[0] and i2 < ind2.shape[0]:
+ j1 = ind1[i1]
+ j2 = ind2[i2]
+
+ if j1 == j2:
+ val = data1[i1] + data2[i2]
+ if val != 0:
+ result_data1[nnz] = data1[i1]
+ result_data2[nnz] = data2[i2]
+ nnz += 1
+ i1 += 1
+ i2 += 1
+ elif j1 < j2:
+ val = data1[i1]
+ if val != 0:
+ result_data1[nnz] = data1[i1]
+ nnz += 1
+ i1 += 1
+ else:
+ val = data2[i2]
+ if val != 0:
+ result_data2[nnz] = data2[i2]
+ nnz += 1
+ i2 += 1
+
+ # pass over the tails
+ while i1 < ind1.shape[0]:
+ val = data1[i1]
+ if val != 0:
+ result_data1[nnz] = data1[i1]
+ nnz += 1
+ i1 += 1
+
+ while i2 < ind2.shape[0]:
+ val = data2[i2]
+ if val != 0:
+ result_data2[nnz] = data2[i2]
+ nnz += 1
+ i2 += 1
+
+ # truncate to the correct length in case there were zeros
+ result_data1 = result_data1[:nnz]
+ result_data2 = result_data2[:nnz]
+
+ return result_data1, result_data2
+
+
+ at numba.njit(fastmath=True)
def sparse_euclidean(ind1, data1, ind2, data2):
_, aux_data = sparse_diff(ind1, data1, ind2, data2)
result = 0.0
@@ -315,8 +490,8 @@ def sparse_bray_curtis(ind1, data1, ind2, data2): # pragma: no cover
@numba.njit()
def sparse_jaccard(ind1, data1, ind2, data2):
- num_non_zero = arr_union(ind1, ind2).shape[0]
- num_equal = arr_intersect(ind1, ind2).shape[0]
+ num_equal = fast_intersection_size(ind1, ind2)
+ num_non_zero = ind1.shape[0] + ind2.shape[0] - num_equal
if num_non_zero == 0:
return 0.0
@@ -335,30 +510,31 @@ def sparse_jaccard(ind1, data1, ind2, data2):
),
],
fastmath=True,
- locals={
- "num_non_zero": numba.types.intp,
- "num_equal": numba.types.intp,
- },
+ locals={"num_non_zero": numba.types.intp, "num_equal": numba.types.intp},
)
def sparse_alternative_jaccard(ind1, data1, ind2, data2):
- num_non_zero = arr_union(ind1, ind2).shape[0]
- num_equal = arr_intersect(ind1, ind2).shape[0]
+ num_equal = fast_intersection_size(ind1, ind2)
+ num_non_zero = ind1.shape[0] + ind2.shape[0] - num_equal
if num_non_zero == 0:
return 0.0
+ elif num_equal == 0:
+ return FLOAT32_MAX
else:
return -np.log2(num_equal / num_non_zero)
+ # return (num_non_zero - num_equal) / num_equal
@numba.vectorize(fastmath=True)
def correct_alternative_jaccard(v):
return 1.0 - pow(2.0, -v)
+ # return v / (v + 1)
@numba.njit()
def sparse_matching(ind1, data1, ind2, data2, n_features):
- num_true_true = arr_intersect(ind1, ind2).shape[0]
- num_non_zero = arr_union(ind1, ind2).shape[0]
+ num_true_true = fast_intersection_size(ind1, ind2)
+ num_non_zero = ind1.shape[0] + ind2.shape[0] - num_true_true
num_not_equal = num_non_zero - num_true_true
return float(num_not_equal) / n_features
@@ -366,8 +542,8 @@ def sparse_matching(ind1, data1, ind2, data2, n_features):
@numba.njit()
def sparse_dice(ind1, data1, ind2, data2):
- num_true_true = arr_intersect(ind1, ind2).shape[0]
- num_non_zero = arr_union(ind1, ind2).shape[0]
+ num_true_true = fast_intersection_size(ind1, ind2)
+ num_non_zero = ind1.shape[0] + ind2.shape[0] - num_true_true
num_not_equal = num_non_zero - num_true_true
if num_not_equal == 0.0:
@@ -378,8 +554,8 @@ def sparse_dice(ind1, data1, ind2, data2):
@numba.njit()
def sparse_kulsinski(ind1, data1, ind2, data2, n_features):
- num_true_true = arr_intersect(ind1, ind2).shape[0]
- num_non_zero = arr_union(ind1, ind2).shape[0]
+ num_true_true = fast_intersection_size(ind1, ind2)
+ num_non_zero = ind1.shape[0] + ind2.shape[0] - num_true_true
num_not_equal = num_non_zero - num_true_true
if num_not_equal == 0:
@@ -392,8 +568,8 @@ def sparse_kulsinski(ind1, data1, ind2, data2, n_features):
@numba.njit()
def sparse_rogers_tanimoto(ind1, data1, ind2, data2, n_features):
- num_true_true = arr_intersect(ind1, ind2).shape[0]
- num_non_zero = arr_union(ind1, ind2).shape[0]
+ num_true_true = fast_intersection_size(ind1, ind2)
+ num_non_zero = ind1.shape[0] + ind2.shape[0] - num_true_true
num_not_equal = num_non_zero - num_true_true
return (2.0 * num_not_equal) / (n_features + num_not_equal)
@@ -404,7 +580,7 @@ def sparse_russellrao(ind1, data1, ind2, data2, n_features):
if ind1.shape[0] == ind2.shape[0] and np.all(ind1 == ind2):
return 0.0
- num_true_true = arr_intersect(ind1, ind2).shape[0]
+ num_true_true = fast_intersection_size(ind1, ind2)
if num_true_true == np.sum(data1 != 0) and num_true_true == np.sum(data2 != 0):
return 0.0
@@ -414,8 +590,8 @@ def sparse_russellrao(ind1, data1, ind2, data2, n_features):
@numba.njit()
def sparse_sokal_michener(ind1, data1, ind2, data2, n_features):
- num_true_true = arr_intersect(ind1, ind2).shape[0]
- num_non_zero = arr_union(ind1, ind2).shape[0]
+ num_true_true = fast_intersection_size(ind1, ind2)
+ num_non_zero = ind1.shape[0] + ind2.shape[0] - num_true_true
num_not_equal = num_non_zero - num_true_true
return (2.0 * num_not_equal) / (n_features + num_not_equal)
@@ -423,8 +599,8 @@ def sparse_sokal_michener(ind1, data1, ind2, data2, n_features):
@numba.njit()
def sparse_sokal_sneath(ind1, data1, ind2, data2):
- num_true_true = arr_intersect(ind1, ind2).shape[0]
- num_non_zero = arr_union(ind1, ind2).shape[0]
+ num_true_true = fast_intersection_size(ind1, ind2)
+ num_non_zero = ind1.shape[0] + ind2.shape[0] - num_true_true
num_not_equal = num_non_zero - num_true_true
if num_not_equal == 0.0:
@@ -491,11 +667,7 @@ def sparse_correct_alternative_cosine(d):
@numba.njit()
def sparse_dot(ind1, data1, ind2, data2):
- _, aux_data = sparse_mul(ind1, data1, ind2, data2)
- result = 0.0
-
- for val in aux_data:
- result += val
+ result = sparse_dot_product(ind1, data1, ind2, data2)
return 1.0 - result
@@ -505,16 +677,10 @@ def sparse_dot(ind1, data1, ind2, data2):
fastmath=True,
locals={
"result": numba.types.float32,
- "dim": numba.types.intp,
- "i": numba.types.uint16,
},
)
def sparse_alternative_dot(ind1, data1, ind2, data2):
- _, aux_data = sparse_mul(ind1, data1, ind2, data2)
- result = 0.0
- dim = len(aux_data)
- for i in range(dim):
- result += aux_data[i]
+ result = sparse_dot_product(ind1, data1, ind2, data2)
if result <= 0.0:
return FLOAT32_MAX
@@ -696,7 +862,83 @@ def sparse_kantorovich(ind1, data1, ind2, data2, ground_metric=dummy_ground_metr
return kantorovich(data1, data2, cost_matrix)
- at numba.njit(parallel=True)
+ at numba.njit()
+def sparse_wasserstein_1d(ind1, data1, ind2, data2, p=1):
+ result = 0.0
+ old_ind = 0
+ delta = 0.0
+ i1 = 0
+ i2 = 0
+ cdf1 = 0.0
+ cdf2 = 0.0
+ l1_norm_x = np.sum(data1)
+ l1_norm_y = np.sum(data2)
+
+ norm = lambda x, p: np.power(np.abs(x), p)
+
+ # pass through both index lists
+ while i1 < ind1.shape[0] and i2 < ind2.shape[0]:
+ j1 = ind1[i1]
+ j2 = ind2[i2]
+
+ if j1 == j2:
+ result += delta * (j1 - old_ind)
+ cdf1 += data1[i1] / l1_norm_x
+ cdf2 += data2[i2] / l1_norm_y
+ delta = norm(cdf1 - cdf2, p)
+ old_ind = j1
+ i1 += 1
+ i2 += 1
+ elif j1 < j2:
+ result += delta * (j1 - old_ind)
+ cdf1 += data1[i1] / l1_norm_x
+ delta = norm(cdf1 - cdf2, p)
+ old_ind = j1
+ i1 += 1
+ else:
+ result += delta * (j2 - old_ind)
+ cdf2 += data2[i2] / l1_norm_y
+ delta = norm(cdf1 - cdf2, p)
+ old_ind = j2
+ i2 += 1
+ # pass over the tails
+ while i1 < ind1.shape[0]:
+ j1 = ind1[i1]
+ result += delta * (j1 - old_ind)
+ cdf1 += data1[i1] / l1_norm_x
+ delta = norm(cdf1 - cdf2, p)
+ old_ind = j1
+ i1 += 1
+
+ while i2 < ind2.shape[0]:
+ j2 = ind2[i2]
+ result += delta * (j2 - old_ind)
+ cdf2 += data2[i2] / l1_norm_y
+ delta = norm(cdf1 - cdf2, p)
+ old_ind = j2
+ i2 += 1
+
+ return np.power(result, 1.0 / p)
+
+
+# Because of the EPS values and the need to normalize after adding them (and then average those for jensen_shannon)
+# it seems like we might as well just take the dense union (dense vectors supported on the union of indices)
+# and call the dense distance functions
+
+
+ at numba.njit()
+def sparse_jensen_shannon_divergence(ind1, data1, ind2, data2):
+ dense_data1, dense_data2 = dense_union(ind1, data1, ind2, data2)
+ return jensen_shannon_divergence(dense_data1, dense_data2)
+
+
+ at numba.njit()
+def sparse_symmetric_kl_divergence(ind1, data1, ind2, data2):
+ dense_data1, dense_data2 = dense_union(ind1, data1, ind2, data2)
+ return symmetric_kl_divergence(dense_data1, dense_data2)
+
+
+ at numba.njit(parallel=True, cache=True)
def diversify(
indices,
distances,
@@ -751,7 +993,7 @@ def diversify(
return indices, distances
- at numba.njit(parallel=True)
+ at numba.njit(parallel=True, cache=True)
def diversify_csr(
graph_indptr,
graph_indices,
@@ -821,8 +1063,6 @@ sparse_named_distances = {
"minkowski": sparse_minkowski,
# Other distances
"canberra": sparse_canberra,
- "kantorovich": sparse_kantorovich,
- "wasserstein": sparse_kantorovich,
"braycurtis": sparse_bray_curtis,
# Binary distances
"hamming": sparse_hamming,
@@ -837,7 +1077,19 @@ sparse_named_distances = {
# Angular distances
"cosine": sparse_cosine,
"correlation": sparse_correlation,
+ # Distribution distances
+ "kantorovich": sparse_kantorovich,
+ "wasserstein": sparse_kantorovich,
+ "wasserstein_1d": sparse_wasserstein_1d,
+ "wasserstein-1d": sparse_wasserstein_1d,
+ "kantorovich-1d": sparse_wasserstein_1d,
+ "kantorovich-1d": sparse_wasserstein_1d,
"hellinger": sparse_hellinger,
+ "jensen-shannon": sparse_jensen_shannon_divergence,
+ "jensen_shannon": sparse_jensen_shannon_divergence,
+ "symmetric-kl": sparse_symmetric_kl_divergence,
+ "symmetric_kl": sparse_symmetric_kl_divergence,
+ "symmetric_kullback_liebler": sparse_symmetric_kl_divergence,
}
sparse_need_n_features = (
=====================================
pynndescent/sparse_nndescent.py
=====================================
@@ -24,15 +24,8 @@ locale.setlocale(locale.LC_NUMERIC, "C")
EMPTY_GRAPH = make_heap(1, 1)
- at numba.njit(parallel=True)
-def generate_leaf_updates(
- leaf_block,
- dist_thresholds,
- inds,
- indptr,
- data,
- dist,
-):
+ at numba.njit(parallel=True, cache=True)
+def generate_leaf_updates(leaf_block, dist_thresholds, inds, indptr, data, dist):
updates = [[(-1, -1, np.inf)] for i in range(leaf_block.shape[0])]
@@ -60,13 +53,7 @@ def generate_leaf_updates(
return updates
- at numba.njit(
- locals={
- "d": numba.float32,
- "p": numba.int32,
- "q": numba.int32,
- }
-)
+ at numba.njit(locals={"d": numba.float32, "p": numba.int32, "q": numba.int32}, cache=True)
def init_rp_tree(inds, indptr, data, dist, current_graph, leaf_array):
n_leaves = leaf_array.shape[0]
@@ -81,12 +68,7 @@ def init_rp_tree(inds, indptr, data, dist, current_graph, leaf_array):
dist_thresholds = current_graph[1][:, 0]
updates = generate_leaf_updates(
- leaf_block,
- dist_thresholds,
- inds,
- indptr,
- data,
- dist,
+ leaf_block, dist_thresholds, inds, indptr, data, dist
)
for j in range(len(updates)):
@@ -96,8 +78,6 @@ def init_rp_tree(inds, indptr, data, dist, current_graph, leaf_array):
if p == -1 or q == -1:
continue
- # heap_push(current_graph, p, d, q, 1)
- # heap_push(current_graph, q, d, p, 1)
checked_flagged_heap_push(
current_graph[1][p],
current_graph[0][p],
@@ -118,11 +98,8 @@ def init_rp_tree(inds, indptr, data, dist, current_graph, leaf_array):
@numba.njit(
fastmath=True,
- locals={
- "d": numba.float32,
- "i": numba.int32,
- "idx": numba.int32,
- },
+ locals={"d": numba.float32, "i": numba.int32, "idx": numba.int32},
+ cache=True,
)
def init_random(n_neighbors, inds, indptr, data, heap, dist, rng_state):
n_samples = indptr.shape[0] - 1
@@ -138,7 +115,6 @@ def init_random(n_neighbors, inds, indptr, data, heap, dist, rng_state):
to_data = data[indptr[i] : indptr[i + 1]]
d = dist(from_inds, from_data, to_inds, to_data)
- # heap_push(heap, i, d, idx, 1)
checked_flagged_heap_push(
heap[1][i], heap[0][i], heap[2][i], d, idx, np.uint8(1)
)
@@ -146,15 +122,9 @@ def init_random(n_neighbors, inds, indptr, data, heap, dist, rng_state):
return
- at numba.njit(parallel=True)
+ at numba.njit(parallel=True, cache=True)
def generate_graph_updates(
- new_candidate_block,
- old_candidate_block,
- dist_thresholds,
- inds,
- indptr,
- data,
- dist,
+ new_candidate_block, old_candidate_block, dist_thresholds, inds, indptr, data, dist
):
block_size = new_candidate_block.shape[0]
@@ -216,15 +186,14 @@ def nn_descent_internal_low_memory_parallel(
n_vertices = indptr.shape[0] - 1
block_size = 16384
n_blocks = n_vertices // block_size
+ n_threads = numba.get_num_threads()
for n in range(n_iters):
if verbose:
print("\t", n + 1, " / ", n_iters)
(new_candidate_neighbors, old_candidate_neighbors) = new_build_candidates(
- current_graph,
- max_candidates,
- rng_state,
+ current_graph, max_candidates, rng_state, n_threads
)
c = 0
@@ -246,7 +215,7 @@ def nn_descent_internal_low_memory_parallel(
dist,
)
- c += apply_graph_updates_low_memory(current_graph, updates)
+ c += apply_graph_updates_low_memory(current_graph, updates, n_threads)
if c <= delta * n_neighbors * n_vertices:
if verbose:
@@ -271,6 +240,7 @@ def nn_descent_internal_high_memory_parallel(
n_vertices = indptr.shape[0] - 1
block_size = 16384
n_blocks = n_vertices // block_size
+ n_threads = numba.get_num_threads()
in_graph = [
set(current_graph[0][i].astype(np.int64))
@@ -282,9 +252,7 @@ def nn_descent_internal_high_memory_parallel(
print("\t", n + 1, " / ", n_iters)
(new_candidate_neighbors, old_candidate_neighbors) = new_build_candidates(
- current_graph,
- max_candidates,
- rng_state,
+ current_graph, max_candidates, rng_state, n_threads
)
c = 0
=====================================
pynndescent/tests/conftest.py
=====================================
@@ -0,0 +1,63 @@
+import os
+import pytest
+import numpy as np
+from scipy import sparse
+
+# Making Random Seed as a fixture in case it would be
+# needed in tests for random states
+ at pytest.fixture
+def seed():
+ return 189212 # 0b101110001100011100
+
+
+np.random.seed(189212)
+
+
+ at pytest.fixture
+def spatial_data():
+ sp_data = np.random.randn(10, 20)
+ # Add some all zero graph_data for corner case test
+ sp_data = np.vstack([sp_data, np.zeros((2, 20))]).astype(np.float32, order="C")
+ return sp_data
+
+
+ at pytest.fixture
+def binary_data():
+ bin_data = np.random.choice(a=[False, True], size=(10, 20), p=[0.66, 1 - 0.66])
+ # Add some all zero graph_data for corner case test
+ bin_data = np.vstack([bin_data, np.zeros((2, 20), dtype="bool")])
+ return bin_data
+
+
+ at pytest.fixture
+def sparse_spatial_data(spatial_data, binary_data):
+ sp_sparse_data = sparse.csr_matrix(spatial_data * binary_data, dtype=np.float32)
+ sp_sparse_data.sort_indices()
+ return sp_sparse_data
+
+
+ at pytest.fixture
+def sparse_binary_data(binary_data):
+ bin_sparse_data = sparse.csr_matrix(binary_data)
+ bin_sparse_data.sort_indices()
+ return bin_sparse_data
+
+
+ at pytest.fixture
+def nn_data():
+ nndata = np.random.uniform(0, 1, size=(1000, 5))
+ # Add some all zero graph_data for corner case test
+ nndata = np.vstack([nndata, np.zeros((2, 5))])
+ return nndata
+
+
+ at pytest.fixture
+def sparse_nn_data():
+ return sparse.random(1000, 50, density=0.5, format="csr")
+
+
+ at pytest.fixture
+def cosine_hang_data():
+ this_dir = os.path.dirname(os.path.abspath(__file__))
+ data_path = os.path.join(this_dir, "test_data/cosine_hang.npy")
+ return np.load(data_path)
=====================================
pynndescent/tests/test_distances.py
=====================================
@@ -1,27 +1,30 @@
+import pytest
import numpy as np
from numpy.testing import assert_array_equal, assert_array_almost_equal
import pynndescent.distances as dist
import pynndescent.sparse as spdist
-from scipy import sparse, stats
+from scipy import stats
+from scipy.sparse import csr_matrix
from sklearn.metrics import pairwise_distances
from sklearn.neighbors import BallTree
-
-np.random.seed(42)
-spatial_data = np.random.randn(10, 20)
-spatial_data = np.vstack([spatial_data, np.zeros((2, 20))]).astype(
- np.float32, order="C"
-) # Add some all zero graph_data for corner case test
-binary_data = np.random.choice(a=[False, True], size=(10, 20), p=[0.66, 1 - 0.66])
-binary_data = np.vstack(
- [binary_data, np.zeros((2, 20), dtype="bool")]
-) # Add some all zero graph_data for corner case test
-sparse_spatial_data = sparse.csr_matrix(spatial_data * binary_data, dtype=np.float32)
-sparse_spatial_data.sort_indices()
-sparse_binary_data = sparse.csr_matrix(binary_data)
-sparse_binary_data.sort_indices()
-
-
-def spatial_check(metric):
+from sklearn.preprocessing import normalize
+
+
+ at pytest.mark.parametrize(
+ "metric",
+ [
+ "euclidean",
+ "manhattan",
+ "chebyshev",
+ "minkowski",
+ "hamming",
+ "canberra",
+ "braycurtis",
+ "cosine",
+ "correlation",
+ ],
+)
+def test_spatial_check(spatial_data, metric):
dist_matrix = pairwise_distances(spatial_data, metric=metric)
# scipy is bad sometimes
if metric == "braycurtis":
@@ -48,7 +51,21 @@ def spatial_check(metric):
)
-def binary_check(metric):
+ at pytest.mark.parametrize(
+ "metric",
+ [
+ "jaccard",
+ "matching",
+ "dice",
+ "kulsinski",
+ "rogerstanimoto",
+ "russellrao",
+ "sokalmichener",
+ "sokalsneath",
+ "yule",
+ ],
+)
+def test_binary_check(binary_data, metric):
dist_matrix = pairwise_distances(binary_data, metric=metric)
if metric in ("jaccard", "dice", "sokalsneath", "yule"):
dist_matrix[np.where(~np.isfinite(dist_matrix))] = 0.0
@@ -74,7 +91,21 @@ def binary_check(metric):
)
-def sparse_spatial_check(metric, decimal=6):
+ at pytest.mark.parametrize(
+ "metric",
+ [
+ "euclidean",
+ "manhattan",
+ "chebyshev",
+ "minkowski",
+ "hamming",
+ "canberra",
+ "cosine",
+ "braycurtis",
+ "correlation",
+ ],
+)
+def test_sparse_spatial_check(sparse_spatial_data, metric, decimal=6):
if metric in spdist.sparse_named_distances:
dist_matrix = pairwise_distances(
sparse_spatial_data.todense().astype(np.float32), metric=metric
@@ -127,10 +158,23 @@ def sparse_spatial_check(metric, decimal=6):
)
-def sparse_binary_check(metric):
+ at pytest.mark.parametrize(
+ "metric",
+ [
+ "jaccard",
+ "matching",
+ "dice",
+ "kulsinski",
+ "rogerstanimoto",
+ "russellrao",
+ "sokalmichener",
+ "sokalsneath",
+ ],
+)
+def test_sparse_binary_check(sparse_binary_data, metric):
if metric in spdist.sparse_named_distances:
dist_matrix = pairwise_distances(sparse_binary_data.todense(), metric=metric)
- if metric in ("jaccard", "dice", "sokalsneath", "yule"):
+ if metric in ("jaccard", "dice", "sokalsneath"):
dist_matrix[np.where(~np.isfinite(dist_matrix))] = 0.0
if metric in ("kulsinski", "russellrao"):
dist_matrix[np.where(~np.isfinite(dist_matrix))] = 1.0
@@ -178,147 +222,7 @@ def sparse_binary_check(metric):
)
-def test_euclidean():
- spatial_check("euclidean")
-
-
-def test_manhattan():
- spatial_check("manhattan")
-
-
-def test_chebyshev():
- spatial_check("chebyshev")
-
-
-def test_minkowski():
- spatial_check("minkowski")
-
-
-def test_hamming():
- spatial_check("hamming")
-
-
-def test_canberra():
- spatial_check("canberra")
-
-
-def test_braycurtis():
- spatial_check("braycurtis")
-
-
-def test_cosine():
- spatial_check("cosine")
-
-
-def test_correlation():
- spatial_check("correlation")
-
-
-def test_jaccard():
- binary_check("jaccard")
-
-
-def test_matching():
- binary_check("matching")
-
-
-def test_dice():
- binary_check("dice")
-
-
-def test_kulsinski():
- binary_check("kulsinski")
-
-
-def test_rogerstanimoto():
- binary_check("rogerstanimoto")
-
-
-def test_russellrao():
- binary_check("russellrao")
-
-
-def test_sokalmichener():
- binary_check("sokalmichener")
-
-
-def test_sokalsneath():
- binary_check("sokalsneath")
-
-
-def test_yule():
- binary_check("yule")
-
-
-def test_sparse_euclidean():
- sparse_spatial_check("euclidean")
-
-
-def test_sparse_manhattan():
- sparse_spatial_check("manhattan")
-
-
-def test_sparse_chebyshev():
- sparse_spatial_check("chebyshev")
-
-
-def test_sparse_minkowski():
- sparse_spatial_check("minkowski")
-
-
-def test_sparse_hamming():
- sparse_spatial_check("hamming")
-
-
-def test_sparse_canberra():
- sparse_spatial_check("canberra") # Be a little forgiving
-
-
-def test_sparse_cosine():
- sparse_spatial_check("cosine")
-
-
-def test_sparse_correlation():
- sparse_spatial_check("correlation")
-
-
-def test_sparse_jaccard():
- sparse_binary_check("jaccard")
-
-
-def test_sparse_matching():
- sparse_binary_check("matching")
-
-
-def test_sparse_dice():
- sparse_binary_check("dice")
-
-
-def test_sparse_kulsinski():
- sparse_binary_check("kulsinski")
-
-
-def test_sparse_rogerstanimoto():
- sparse_binary_check("rogerstanimoto")
-
-
-def test_sparse_russellrao():
- sparse_binary_check("russellrao")
-
-
-def test_sparse_sokalmichener():
- sparse_binary_check("sokalmichener")
-
-
-def test_sparse_sokalsneath():
- sparse_binary_check("sokalsneath")
-
-
-def test_sparse_braycurtis():
- sparse_spatial_check("braycurtis")
-
-
-def test_seuclidean():
+def test_seuclidean(spatial_data):
v = np.abs(np.random.randn(spatial_data.shape[1]))
dist_matrix = pairwise_distances(spatial_data, metric="seuclidean", V=v)
test_matrix = np.array(
@@ -337,7 +241,7 @@ def test_seuclidean():
)
-def test_weighted_minkowski():
+def test_weighted_minkowski(spatial_data):
v = np.abs(np.random.randn(spatial_data.shape[1]))
dist_matrix = pairwise_distances(spatial_data, metric="wminkowski", w=v, p=3)
test_matrix = np.array(
@@ -356,7 +260,7 @@ def test_weighted_minkowski():
)
-def test_mahalanobis():
+def test_mahalanobis(spatial_data):
v = np.cov(np.transpose(spatial_data))
dist_matrix = pairwise_distances(spatial_data, metric="mahalanobis", VI=v)
test_matrix = np.array(
@@ -375,7 +279,7 @@ def test_mahalanobis():
)
-def test_haversine():
+def test_haversine(spatial_data):
tree = BallTree(spatial_data[:, :2], metric="haversine")
dist_matrix, _ = tree.query(spatial_data[:, :2], k=spatial_data.shape[0])
test_matrix = np.array(
@@ -422,3 +326,69 @@ def test_alternative_distances():
corrected_alt_distance = correction(alt_dist(x, y))
assert np.isclose(true_distance, corrected_alt_distance)
+
+
+def test_jensen_shannon():
+ test_data = np.random.random(size=(10, 50))
+ test_data = normalize(test_data, norm="l1")
+ for i in range(test_data.shape[0]):
+ for j in range(i + 1, test_data.shape[0]):
+ m = (test_data[i] + test_data[j]) / 2.0
+ p = test_data[i]
+ q = test_data[j]
+ d1 = (
+ -np.sum(m * np.log(m))
+ + (np.sum(p * np.log(p)) + np.sum(q * np.log(q))) / 2.0
+ )
+ d2 = dist.jensen_shannon_divergence(p, q)
+ assert np.isclose(d1, d2, rtol=1e-4)
+
+
+def test_sparse_jensen_shannon():
+ test_data = np.random.random(size=(10, 100))
+ # sparsify
+ test_data[test_data <= 0.5] = 0.0
+ sparse_test_data = csr_matrix(test_data)
+ sparse_test_data = normalize(sparse_test_data, norm="l1")
+ test_data = normalize(test_data, norm="l1")
+
+ for i in range(test_data.shape[0]):
+ for j in range(i + 1, test_data.shape[0]):
+ m = (test_data[i] + test_data[j]) / 2.0
+ p = test_data[i]
+ q = test_data[j]
+ d1 = (
+ -np.sum(m[m > 0] * np.log(m[m > 0]))
+ + (
+ np.sum(p[p > 0] * np.log(p[p > 0]))
+ + np.sum(q[q > 0] * np.log(q[q > 0]))
+ )
+ / 2.0
+ )
+ d2 = spdist.sparse_jensen_shannon_divergence(
+ sparse_test_data[i].indices,
+ sparse_test_data[i].data,
+ sparse_test_data[j].indices,
+ sparse_test_data[j].data,
+ )
+ assert np.isclose(d1, d2, rtol=1e-3)
+
+
+ at pytest.mark.parametrize("p", [1.0, 2.0, 3.0, 0.5])
+def test_wasserstein_1d(p):
+ test_data = np.random.random(size=(10, 100))
+ # sparsify
+ test_data[test_data <= 0.5] = 0.0
+ sparse_test_data = csr_matrix(test_data)
+
+ for i in range(test_data.shape[0]):
+ for j in range(i + 1, test_data.shape[0]):
+ d1 = dist.wasserstein_1d(test_data[i], test_data[j], p)
+ d2 = spdist.sparse_wasserstein_1d(
+ sparse_test_data[i].indices,
+ sparse_test_data[i].data,
+ sparse_test_data[j].indices,
+ sparse_test_data[j].data,
+ p,
+ )
+ assert np.isclose(d1, d2)
=====================================
pynndescent/tests/test_pynndescent_.py
=====================================
@@ -1,44 +1,23 @@
import os
import io
import re
+import pytest
from contextlib import redirect_stdout
-from nose.tools import assert_greater_equal, assert_true, assert_equal
-from nose import SkipTest
-
import numpy as np
-from scipy import sparse
from sklearn.neighbors import KDTree
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import normalize
import pickle
import joblib
+import scipy
from pynndescent import NNDescent, PyNNDescentTransformer
-np.random.seed(42)
-spatial_data = np.random.randn(10, 20)
-spatial_data = np.vstack(
- [spatial_data, np.zeros((2, 20))]
-) # Add some all zero graph_data for corner case test
-
-nn_data = np.random.uniform(0, 1, size=(1000, 5))
-nn_data = np.vstack(
- [nn_data, np.zeros((2, 5))]
-) # Add some all zero graph_data for corner case test
-# for_sparse_nn_data = np.random.uniform(0, 1, size=(1002, 500))
-# binary_nn_data = np.random.choice(a=[False, True], size=(1000, 500), p=[0.1, 1 - 0.1])
-# binary_nn_data = np.vstack(
-# [binary_nn_data, np.zeros((2, 500))]
-# ) # Add some all zero graph_data for corner case test
-# sparse_nn_data = sparse.csr_matrix(for_sparse_nn_data * binary_nn_data)
-sparse_nn_data = sparse.random(1000, 50, density=0.5, format="csr")
-# sparse_nn_data = sparse.csr_matrix(nn_data)
-
-
-def test_nn_descent_neighbor_accuracy():
+
+def test_nn_descent_neighbor_accuracy(nn_data, seed):
knn_indices, _ = NNDescent(
- nn_data, "euclidean", {}, 10, random_state=np.random
+ nn_data, "euclidean", {}, 10, random_state=np.random.RandomState(seed)
)._neighbor_graph
tree = KDTree(nn_data)
@@ -49,16 +28,14 @@ def test_nn_descent_neighbor_accuracy():
num_correct += np.sum(np.in1d(true_indices[i], knn_indices[i]))
percent_correct = num_correct / (nn_data.shape[0] * 10)
- assert_greater_equal(
- percent_correct,
- 0.98,
- "NN-descent did not get 99% " "accuracy on nearest neighbors",
+ assert percent_correct >= 0.98, (
+ "NN-descent did not get 99% " "accuracy on nearest neighbors"
)
-def test_angular_nn_descent_neighbor_accuracy():
+def test_angular_nn_descent_neighbor_accuracy(nn_data, seed):
knn_indices, _ = NNDescent(
- nn_data, "cosine", {}, 10, random_state=np.random
+ nn_data, "cosine", {}, 10, random_state=np.random.RandomState(seed)
)._neighbor_graph
angular_data = normalize(nn_data, norm="l2")
@@ -70,14 +47,16 @@ def test_angular_nn_descent_neighbor_accuracy():
num_correct += np.sum(np.in1d(true_indices[i], knn_indices[i]))
percent_correct = num_correct / (nn_data.shape[0] * 10)
- assert_greater_equal(
- percent_correct,
- 0.98,
- "NN-descent did not get 99% " "accuracy on nearest neighbors",
+ assert percent_correct >= 0.98, (
+ "NN-descent did not get 99% " "accuracy on nearest neighbors"
)
-def test_sparse_nn_descent_neighbor_accuracy():
+ at pytest.mark.skipif(
+ list(map(int, scipy.version.version.split("."))) < [1, 3, 0],
+ reason="requires scipy >= 1.3.0",
+)
+def test_sparse_nn_descent_neighbor_accuracy(sparse_nn_data, seed):
knn_indices, _ = NNDescent(
sparse_nn_data, "euclidean", n_neighbors=20, random_state=None
)._neighbor_graph
@@ -90,14 +69,16 @@ def test_sparse_nn_descent_neighbor_accuracy():
num_correct += np.sum(np.in1d(true_indices[i], knn_indices[i]))
percent_correct = num_correct / (sparse_nn_data.shape[0] * 10)
- assert_greater_equal(
- percent_correct,
- 0.85,
- "Sparse NN-descent did not get 95% " "accuracy on nearest neighbors",
+ assert percent_correct >= 0.85, (
+ "Sparse NN-descent did not get 95% " "accuracy on nearest neighbors"
)
-def test_sparse_angular_nn_descent_neighbor_accuracy():
+ at pytest.mark.skipif(
+ list(map(int, scipy.version.version.split("."))) < [1, 3, 0],
+ reason="requires scipy >= 1.3.0",
+)
+def test_sparse_angular_nn_descent_neighbor_accuracy(sparse_nn_data):
knn_indices, _ = NNDescent(
sparse_nn_data, "cosine", {}, 20, random_state=None
)._neighbor_graph
@@ -111,14 +92,12 @@ def test_sparse_angular_nn_descent_neighbor_accuracy():
num_correct += np.sum(np.in1d(true_indices[i], knn_indices[i]))
percent_correct = num_correct / (sparse_nn_data.shape[0] * 10)
- assert_greater_equal(
- percent_correct,
- 0.85,
- "Sparse angular NN-descent did not get 98% " "accuracy on nearest neighbors",
+ assert percent_correct >= 0.85, (
+ "Sparse angular NN-descent did not get 98% " "accuracy on nearest neighbors"
)
-def test_nn_descent_query_accuracy():
+def test_nn_descent_query_accuracy(nn_data):
nnd = NNDescent(nn_data[200:], "euclidean", n_neighbors=10, random_state=None)
knn_indices, _ = nnd.query(nn_data[:200], k=10, epsilon=0.2)
@@ -130,14 +109,12 @@ def test_nn_descent_query_accuracy():
num_correct += np.sum(np.in1d(true_indices[i], knn_indices[i]))
percent_correct = num_correct / (true_indices.shape[0] * 10)
- assert_greater_equal(
- percent_correct,
- 0.95,
- "NN-descent query did not get 95% " "accuracy on nearest neighbors",
+ assert percent_correct >= 0.95, (
+ "NN-descent query did not get 95% " "accuracy on nearest neighbors"
)
-def test_nn_descent_query_accuracy_angular():
+def test_nn_descent_query_accuracy_angular(nn_data):
nnd = NNDescent(nn_data[200:], "cosine", n_neighbors=30, random_state=None)
knn_indices, _ = nnd.query(nn_data[:200], k=10, epsilon=0.32)
@@ -149,14 +126,12 @@ def test_nn_descent_query_accuracy_angular():
num_correct += np.sum(np.in1d(true_indices[i], knn_indices[i]))
percent_correct = num_correct / (true_indices.shape[0] * 10)
- assert_greater_equal(
- percent_correct,
- 0.95,
- "NN-descent query did not get 95% " "accuracy on nearest neighbors",
+ assert percent_correct >= 0.95, (
+ "NN-descent query did not get 95% " "accuracy on nearest neighbors"
)
-def test_sparse_nn_descent_query_accuracy():
+def test_sparse_nn_descent_query_accuracy(sparse_nn_data):
nnd = NNDescent(
sparse_nn_data[200:], "euclidean", n_neighbors=15, random_state=None
)
@@ -170,14 +145,12 @@ def test_sparse_nn_descent_query_accuracy():
num_correct += np.sum(np.in1d(true_indices[i], knn_indices[i]))
percent_correct = num_correct / (true_indices.shape[0] * 10)
- assert_greater_equal(
- percent_correct,
- 0.95,
- "Sparse NN-descent query did not get 95% " "accuracy on nearest neighbors",
+ assert percent_correct >= 0.95, (
+ "Sparse NN-descent query did not get 95% " "accuracy on nearest neighbors"
)
-def test_sparse_nn_descent_query_accuracy_angular():
+def test_sparse_nn_descent_query_accuracy_angular(sparse_nn_data):
nnd = NNDescent(sparse_nn_data[200:], "cosine", n_neighbors=50, random_state=None)
knn_indices, _ = nnd.query(sparse_nn_data[:200], k=10, epsilon=0.36)
@@ -191,14 +164,12 @@ def test_sparse_nn_descent_query_accuracy_angular():
num_correct += np.sum(np.in1d(true_indices[i], knn_indices[i]))
percent_correct = num_correct / (true_indices.shape[0] * 10)
- assert_greater_equal(
- percent_correct,
- 0.95,
- "Sparse NN-descent query did not get 95% " "accuracy on nearest neighbors",
+ assert percent_correct >= 0.95, (
+ "Sparse NN-descent query did not get 95% " "accuracy on nearest neighbors"
)
-def test_transformer_equivalence():
+def test_transformer_equivalence(nn_data):
N_NEIGHBORS = 15
EPSILON = 0.15
train = nn_data[:400]
@@ -225,7 +196,7 @@ def test_transformer_equivalence():
assert np.allclose(Xt.data, dists_sorted.flat)
-def test_random_state_none():
+def test_random_state_none(nn_data, spatial_data):
knn_indices, _ = NNDescent(
nn_data, "euclidean", {}, 10, random_state=None
)._neighbor_graph
@@ -238,10 +209,8 @@ def test_random_state_none():
num_correct += np.sum(np.in1d(true_indices[i], knn_indices[i]))
percent_correct = num_correct / (spatial_data.shape[0] * 10)
- assert_greater_equal(
- percent_correct,
- 0.99,
- "NN-descent did not get 99% " "accuracy on nearest neighbors",
+ assert percent_correct >= 0.99, (
+ "NN-descent did not get 99% " "accuracy on nearest neighbors"
)
@@ -265,42 +234,44 @@ def test_deterministic():
# https://github.com/lmcinnes/umap/issues/99
# graph_data used is a cut-down version of that provided by @scharron
# It contains lots of all-zero vectors and some other duplicates
-def test_rp_trees_should_not_stack_overflow_with_duplicate_data():
- this_dir = os.path.dirname(os.path.abspath(__file__))
- data_path = os.path.join(this_dir, "test_data/cosine_hang.npy")
- data = np.load(data_path)
+def test_rp_trees_should_not_stack_overflow_with_duplicate_data(seed, cosine_hang_data):
n_neighbors = 10
knn_indices, _ = NNDescent(
- data, "cosine", {}, n_neighbors, random_state=np.random, n_trees=20
+ cosine_hang_data,
+ "cosine",
+ {},
+ n_neighbors,
+ random_state=np.random.RandomState(seed),
+ n_trees=20,
)._neighbor_graph
- for i in range(data.shape[0]):
- assert_equal(
- len(knn_indices[i]),
- len(np.unique(knn_indices[i])),
- "Duplicate graph_indices in knn graph",
- )
+ for i in range(cosine_hang_data.shape[0]):
+ assert len(knn_indices[i]) == len(
+ np.unique(knn_indices[i])
+ ), "Duplicate graph_indices in knn graph"
+
+def test_deduplicated_data_behaves_normally(seed, cosine_hang_data):
-def test_deduplicated_data_behaves_normally():
- this_dir = os.path.dirname(os.path.abspath(__file__))
- data_path = os.path.join(this_dir, "test_data/cosine_hang.npy")
- data = np.unique(np.load(data_path), axis=0)
+ data = np.unique(cosine_hang_data, axis=0)
data = data[~np.all(data == 0, axis=1)]
data = data[:1000]
n_neighbors = 10
knn_indices, _ = NNDescent(
- data, "cosine", {}, n_neighbors, random_state=np.random, n_trees=20
+ data,
+ "cosine",
+ {},
+ n_neighbors,
+ random_state=np.random.RandomState(seed),
+ n_trees=20,
)._neighbor_graph
for i in range(data.shape[0]):
- assert_equal(
- len(knn_indices[i]),
- len(np.unique(knn_indices[i])),
- "Duplicate graph_indices in knn graph",
- )
+ assert len(knn_indices[i]) == len(
+ np.unique(knn_indices[i])
+ ), "Duplicate graph_indices in knn graph"
angular_data = normalize(data, norm="l2")
tree = KDTree(angular_data)
@@ -311,14 +282,12 @@ def test_deduplicated_data_behaves_normally():
num_correct += np.sum(np.in1d(true_indices[i], knn_indices[i]))
proportion_correct = num_correct / (data.shape[0] * n_neighbors)
- assert_greater_equal(
- proportion_correct,
- 0.95,
- "NN-descent did not get 95%" " accuracy on nearest neighbors",
+ assert proportion_correct >= 0.95, (
+ "NN-descent did not get 95%" " accuracy on nearest neighbors"
)
-def test_output_when_verbose_is_true():
+def test_output_when_verbose_is_true(spatial_data, seed):
out = io.StringIO()
with redirect_stdout(out):
_ = NNDescent(
@@ -326,17 +295,17 @@ def test_output_when_verbose_is_true():
metric="euclidean",
metric_kwds={},
n_neighbors=4,
- random_state=np.random,
+ random_state=np.random.RandomState(seed),
n_trees=5,
n_iters=2,
verbose=True,
)
output = out.getvalue()
- assert_true(re.match("^.*5 trees", output, re.DOTALL))
- assert_true(re.match("^.*2 iterations", output, re.DOTALL))
+ assert re.match("^.*5 trees", output, re.DOTALL)
+ assert re.match("^.*2 iterations", output, re.DOTALL)
-def test_no_output_when_verbose_is_false():
+def test_no_output_when_verbose_is_false(spatial_data, seed):
out = io.StringIO()
with redirect_stdout(out):
_ = NNDescent(
@@ -344,48 +313,48 @@ def test_no_output_when_verbose_is_false():
metric="euclidean",
metric_kwds={},
n_neighbors=4,
- random_state=np.random,
+ random_state=np.random.RandomState(seed),
n_trees=5,
n_iters=2,
verbose=False,
)
output = out.getvalue().strip()
- assert_equal(len(output), 0)
+ assert len(output) == 0
# same as the previous two test, but this time using the PyNNDescentTransformer
# interface
-def test_transformer_output_when_verbose_is_true():
+def test_transformer_output_when_verbose_is_true(spatial_data, seed):
out = io.StringIO()
with redirect_stdout(out):
_ = PyNNDescentTransformer(
n_neighbors=4,
metric="euclidean",
metric_kwds={},
- random_state=np.random,
+ random_state=np.random.RandomState(seed),
n_trees=5,
n_iters=2,
verbose=True,
).fit_transform(spatial_data)
output = out.getvalue()
- assert_true(re.match("^.*5 trees", output, re.DOTALL))
- assert_true(re.match("^.*2 iterations", output, re.DOTALL))
+ assert re.match("^.*5 trees", output, re.DOTALL)
+ assert re.match("^.*2 iterations", output, re.DOTALL)
-def test_transformer_output_when_verbose_is_false():
+def test_transformer_output_when_verbose_is_false(spatial_data, seed):
out = io.StringIO()
with redirect_stdout(out):
_ = PyNNDescentTransformer(
n_neighbors=4,
metric="standardised_euclidean",
metric_kwds={"sigma": np.ones(spatial_data.shape[1])},
- random_state=np.random,
+ random_state=np.random.RandomState(seed),
n_trees=5,
n_iters=2,
verbose=False,
).fit_transform(spatial_data)
output = out.getvalue().strip()
- assert_equal(len(output), 0)
+ assert len(output) == 0
def test_pickle_unpickle():
@@ -394,18 +363,13 @@ def test_pickle_unpickle():
x1 = seed.normal(0, 100, (1000, 50))
x2 = seed.normal(0, 100, (1000, 50))
- index1 = NNDescent(
- x1,
- "euclidean",
- {},
- 10,
- random_state=None,
- )
+ index1 = NNDescent(x1, "euclidean", {}, 10, random_state=None)
neighbors1, distances1 = index1.query(x2)
- pickle.dump(index1, open("test_tmp.pkl", "wb"))
- index2 = pickle.load(open("test_tmp.pkl", "rb"))
- os.remove("test_tmp.pkl")
+ mem_temp = io.BytesIO()
+ pickle.dump(index1, mem_temp)
+ mem_temp.seek(0)
+ index2 = pickle.load(mem_temp)
neighbors2, distances2 = index2.query(x2)
@@ -419,19 +383,13 @@ def test_compressed_pickle_unpickle():
x1 = seed.normal(0, 100, (1000, 50))
x2 = seed.normal(0, 100, (1000, 50))
- index1 = NNDescent(
- x1,
- "euclidean",
- {},
- 10,
- random_state=None,
- compressed=True,
- )
+ index1 = NNDescent(x1, "euclidean", {}, 10, random_state=None, compressed=True)
neighbors1, distances1 = index1.query(x2)
- pickle.dump(index1, open("test_tmp.pkl", "wb"))
- index2 = pickle.load(open("test_tmp.pkl", "rb"))
- os.remove("test_tmp.pkl")
+ mem_temp = io.BytesIO()
+ pickle.dump(index1, mem_temp)
+ mem_temp.seek(0)
+ index2 = pickle.load(mem_temp)
neighbors2, distances2 = index2.query(x2)
@@ -448,9 +406,10 @@ def test_transformer_pickle_unpickle():
index1 = PyNNDescentTransformer(n_neighbors=10).fit(x1)
result1 = index1.transform(x2)
- pickle.dump(index1, open("test_tmp.pkl", "wb"))
- index2 = pickle.load(open("test_tmp.pkl", "rb"))
- os.remove("test_tmp.pkl")
+ mem_temp = io.BytesIO()
+ pickle.dump(index1, mem_temp)
+ mem_temp.seek(0)
+ index2 = pickle.load(mem_temp)
result2 = index2.transform(x2)
@@ -464,18 +423,13 @@ def test_joblib_dump():
x1 = seed.normal(0, 100, (1000, 50))
x2 = seed.normal(0, 100, (1000, 50))
- index1 = NNDescent(
- x1,
- "euclidean",
- {},
- 10,
- random_state=None,
- )
+ index1 = NNDescent(x1, "euclidean", {}, 10, random_state=None)
neighbors1, distances1 = index1.query(x2)
- joblib.dump(index1, "test_tmp.dump")
- index2 = joblib.load("test_tmp.dump")
- os.remove("test_tmp.dump")
+ mem_temp = io.BytesIO()
+ joblib.dump(index1, mem_temp)
+ mem_temp.seek(0)
+ index2 = joblib.load(mem_temp)
neighbors2, distances2 = index2.query(x2)
=====================================
pynndescent/tests/test_rank.py
=====================================
@@ -1,5 +1,6 @@
+import pytest
import numpy as np
-from numpy.testing import assert_equal, assert_array_equal
+from numpy.testing import assert_array_equal
from pynndescent.distances import rankdata
@@ -94,51 +95,132 @@ def test_big_tie():
assert_array_equal(r, expected_rank * data, "test failed with n=%d" % n)
-# fmt: off
-_cases = (
- # values, method, expected
- (np.array([], np.float64), 'average', np.array([], np.float64)),
- (np.array([], np.float64), 'min', np.array([], np.float64)),
- (np.array([], np.float64), 'max', np.array([], np.float64)),
- (np.array([], np.float64), 'dense', np.array([], np.float64)),
- (np.array([], np.float64), 'ordinal', np.array([], np.float64)),
- #
- (np.array([100], np.float64), 'average', np.array([1.0], np.float64)),
- (np.array([100], np.float64), 'min', np.array([1.0], np.float64)),
- (np.array([100], np.float64), 'max', np.array([1.0], np.float64)),
- (np.array([100], np.float64), 'dense', np.array([1.0], np.float64)),
- (np.array([100], np.float64), 'ordinal', np.array([1.0], np.float64)),
- # #
- (np.array([100, 100, 100], np.float64), 'average', np.array([2.0, 2.0, 2.0], np.float64)),
- (np.array([100, 100, 100], np.float64), 'min', np.array([1.0, 1.0, 1.0], np.float64)),
- (np.array([100, 100, 100], np.float64), 'max', np.array([3.0, 3.0, 3.0], np.float64)),
- (np.array([100, 100, 100], np.float64), 'dense', np.array([1.0, 1.0, 1.0], np.float64)),
- (np.array([100, 100, 100], np.float64), 'ordinal', np.array([1.0, 2.0, 3.0], np.float64)),
- #
- (np.array([100, 300, 200], np.float64), 'average', np.array([1.0, 3.0, 2.0], np.float64)),
- (np.array([100, 300, 200], np.float64), 'min', np.array([1.0, 3.0, 2.0], np.float64)),
- (np.array([100, 300, 200], np.float64), 'max', np.array([1.0, 3.0, 2.0], np.float64)),
- (np.array([100, 300, 200], np.float64), 'dense', np.array([1.0, 3.0, 2.0], np.float64)),
- (np.array([100, 300, 200], np.float64), 'ordinal', np.array([1.0, 3.0, 2.0], np.float64)),
- #
- (np.array([100, 200, 300, 200], np.float64), 'average', np.array([1.0, 2.5, 4.0, 2.5], np.float64)),
- (np.array([100, 200, 300, 200], np.float64), 'min', np.array([1.0, 2.0, 4.0, 2.0], np.float64)),
- (np.array([100, 200, 300, 200], np.float64), 'max', np.array([1.0, 3.0, 4.0, 3.0], np.float64)),
- (np.array([100, 200, 300, 200], np.float64), 'dense', np.array([1.0, 2.0, 3.0, 2.0], np.float64)),
- (np.array([100, 200, 300, 200], np.float64), 'ordinal', np.array([1.0, 2.0, 4.0, 3.0], np.float64)),
- #
- (np.array([100, 200, 300, 200, 100], np.float64), 'average', np.array([1.5, 3.5, 5.0, 3.5, 1.5], np.float64)),
- (np.array([100, 200, 300, 200, 100], np.float64), 'min', np.array([1.0, 3.0, 5.0, 3.0, 1.0], np.float64)),
- (np.array([100, 200, 300, 200, 100], np.float64), 'max', np.array([2.0, 4.0, 5.0, 4.0, 2.0], np.float64)),
- (np.array([100, 200, 300, 200, 100], np.float64), 'dense', np.array([1.0, 2.0, 3.0, 2.0, 1.0], np.float64)),
- (np.array([100, 200, 300, 200, 100], np.float64), 'ordinal', np.array([1.0, 3.0, 5.0, 4.0, 2.0], np.float64)),
- #
- (np.array([10] * 30, np.float64), 'ordinal', np.arange(1.0, 31.0, dtype=np.float64)),
+ at pytest.mark.parametrize(
+ "values,method,expected",
+ [ # values, method, expected
+ (np.array([], np.float64), "average", np.array([], np.float64)),
+ (np.array([], np.float64), "min", np.array([], np.float64)),
+ (np.array([], np.float64), "max", np.array([], np.float64)),
+ (np.array([], np.float64), "dense", np.array([], np.float64)),
+ (np.array([], np.float64), "ordinal", np.array([], np.float64)),
+ #
+ (np.array([100], np.float64), "average", np.array([1.0], np.float64)),
+ (np.array([100], np.float64), "min", np.array([1.0], np.float64)),
+ (np.array([100], np.float64), "max", np.array([1.0], np.float64)),
+ (np.array([100], np.float64), "dense", np.array([1.0], np.float64)),
+ (np.array([100], np.float64), "ordinal", np.array([1.0], np.float64)),
+ # #
+ (
+ np.array([100, 100, 100], np.float64),
+ "average",
+ np.array([2.0, 2.0, 2.0], np.float64),
+ ),
+ (
+ np.array([100, 100, 100], np.float64),
+ "min",
+ np.array([1.0, 1.0, 1.0], np.float64),
+ ),
+ (
+ np.array([100, 100, 100], np.float64),
+ "max",
+ np.array([3.0, 3.0, 3.0], np.float64),
+ ),
+ (
+ np.array([100, 100, 100], np.float64),
+ "dense",
+ np.array([1.0, 1.0, 1.0], np.float64),
+ ),
+ (
+ np.array([100, 100, 100], np.float64),
+ "ordinal",
+ np.array([1.0, 2.0, 3.0], np.float64),
+ ),
+ #
+ (
+ np.array([100, 300, 200], np.float64),
+ "average",
+ np.array([1.0, 3.0, 2.0], np.float64),
+ ),
+ (
+ np.array([100, 300, 200], np.float64),
+ "min",
+ np.array([1.0, 3.0, 2.0], np.float64),
+ ),
+ (
+ np.array([100, 300, 200], np.float64),
+ "max",
+ np.array([1.0, 3.0, 2.0], np.float64),
+ ),
+ (
+ np.array([100, 300, 200], np.float64),
+ "dense",
+ np.array([1.0, 3.0, 2.0], np.float64),
+ ),
+ (
+ np.array([100, 300, 200], np.float64),
+ "ordinal",
+ np.array([1.0, 3.0, 2.0], np.float64),
+ ),
+ #
+ (
+ np.array([100, 200, 300, 200], np.float64),
+ "average",
+ np.array([1.0, 2.5, 4.0, 2.5], np.float64),
+ ),
+ (
+ np.array([100, 200, 300, 200], np.float64),
+ "min",
+ np.array([1.0, 2.0, 4.0, 2.0], np.float64),
+ ),
+ (
+ np.array([100, 200, 300, 200], np.float64),
+ "max",
+ np.array([1.0, 3.0, 4.0, 3.0], np.float64),
+ ),
+ (
+ np.array([100, 200, 300, 200], np.float64),
+ "dense",
+ np.array([1.0, 2.0, 3.0, 2.0], np.float64),
+ ),
+ (
+ np.array([100, 200, 300, 200], np.float64),
+ "ordinal",
+ np.array([1.0, 2.0, 4.0, 3.0], np.float64),
+ ),
+ #
+ (
+ np.array([100, 200, 300, 200, 100], np.float64),
+ "average",
+ np.array([1.5, 3.5, 5.0, 3.5, 1.5], np.float64),
+ ),
+ (
+ np.array([100, 200, 300, 200, 100], np.float64),
+ "min",
+ np.array([1.0, 3.0, 5.0, 3.0, 1.0], np.float64),
+ ),
+ (
+ np.array([100, 200, 300, 200, 100], np.float64),
+ "max",
+ np.array([2.0, 4.0, 5.0, 4.0, 2.0], np.float64),
+ ),
+ (
+ np.array([100, 200, 300, 200, 100], np.float64),
+ "dense",
+ np.array([1.0, 2.0, 3.0, 2.0, 1.0], np.float64),
+ ),
+ (
+ np.array([100, 200, 300, 200, 100], np.float64),
+ "ordinal",
+ np.array([1.0, 3.0, 5.0, 4.0, 2.0], np.float64),
+ ),
+ #
+ (
+ np.array([10] * 30, np.float64),
+ "ordinal",
+ np.arange(1.0, 31.0, dtype=np.float64),
+ ),
+ ],
)
-# fmt: on
-
-
-def test_cases():
- for values, method, expected in _cases:
- r = rankdata(values, method=method)
- assert_array_equal(r, expected)
+def test_cases(values, method, expected):
+ r = rankdata(values, method=method)
+ assert_array_equal(r, expected)
=====================================
pynndescent/utils.py
=====================================
@@ -6,17 +6,17 @@ import time
import numba
from numba.core import types
-from numba.experimental import structref
+import numba.experimental.structref as structref
import numpy as np
- at numba.njit("void(i8[:], i8)")
+ at numba.njit("void(i8[:], i8)", cache=True)
def seed(rng_state, seed):
"""Seed the random number generator with a given seed."""
rng_state.fill(seed + 0xFFFF)
- at numba.njit("i4(i8[:])")
+ at numba.njit("i4(i8[:])", cache=True)
def tau_rand_int(state):
"""A fast (pseudo)-random number generator.
@@ -42,7 +42,7 @@ def tau_rand_int(state):
return state[0] ^ state[1] ^ state[2]
- at numba.njit("f4(i8[:])")
+ at numba.njit("f4(i8[:])", cache=True)
def tau_rand(state):
"""A fast (pseudo)-random number generator for floats in the range [0,1]
@@ -69,9 +69,10 @@ def tau_rand(state):
locals={
"dim": numba.types.intp,
"i": numba.types.uint32,
- "result": numba.types.float32,
+ # "result": numba.types.float32, # This provides speed, but causes errors in corner cases
},
fastmath=True,
+ cache=True,
)
def norm(vec):
"""Compute the (standard l2) norm of a vector.
@@ -91,7 +92,7 @@ def norm(vec):
return np.sqrt(result)
- at numba.njit()
+ at numba.njit(cache=True)
def rejection_sample(n_samples, pool_size, rng_state):
"""Generate n_samples many integers from 0 to pool_size such that no
integer is selected twice. The duplication constraint is achieved via
@@ -147,31 +148,27 @@ class Heap(structref.StructRefProxy):
return Heap_get_flags(self)
- at numba.njit
+ at numba.njit(cache=True)
def Heap_get_flags(self):
return self.flags
- at numba.njit
+ at numba.njit(cache=True)
def Heap_get_distances(self):
return self.distances
- at numba.njit
+ at numba.njit(cache=True)
def Heap_get_indices(self):
return self.indices
-structref.define_proxy(
- Heap,
- HeapType,
- ["indices", "distances", "flags"],
-)
+structref.define_proxy(Heap, HeapType, ["indices", "distances", "flags"])
# Heap = namedtuple("Heap", ("indices", "distances", "flags"))
- at numba.njit()
+ at numba.njit(cache=True)
def make_heap(n_points, size):
"""Constructor for the numba enabled heap objects. The heaps are used
for approximate nearest neighbor search, maintaining a list of potential
@@ -202,197 +199,7 @@ def make_heap(n_points, size):
return result
- at numba.jit(
- locals={
- "indices": numba.types.int32[::1],
- "weights": numba.types.float32[::1],
- "is_new": numba.types.uint8[::1],
- "i": numba.types.uint16,
- "ic1": numba.types.uint16,
- "ic2": numba.types.uint16,
- "i_swap": numba.types.uint16,
- "heap_size": numba.types.uint16,
- }
-)
-def heap_push(heap, row, weight, index, flag):
- """Push a new element onto the heap. The heap stores potential neighbors
- for each graph_data point. The ``row`` parameter determines which graph_data point we
- are addressing, the ``weight`` determines the distance (for heap sorting),
- the ``index`` is the element to add, and the flag determines whether this
- is to be considered a new addition.
-
- Parameters
- ----------
- heap: ndarray generated by ``make_heap``
- The heap object to push into
-
- row: int
- Which actual heap within the heap object to push to
-
- weight: float
- The priority value of the element to push onto the heap
-
- index: int
- The actual value to be pushed
-
- flag: int
- Whether to flag the newly added element or not.
-
- Returns
- -------
- success: The number of new elements successfully pushed into the heap.
- """
- row = np.int32(row)
- weight = np.float32(weight)
- index = np.int32(index)
- flag = np.uint8(flag)
-
- indices = heap[0][row]
- weights = heap[1][row]
- is_new = heap[2][row]
-
- if weight >= weights[0]:
- return 0
-
- # break if we already have this element.
- for i in range(indices.shape[0]):
- if index == indices[i]:
- return 0
-
- # insert val at position zero
- weights[0] = weight
- indices[0] = index
- is_new[0] = flag
-
- # descend the heap, swapping values until the max heap criterion is met
- i = 0
- while True:
- ic1 = 2 * i + 1
- ic2 = ic1 + 1
-
- if ic1 >= indices.shape[0]:
- break
- elif ic2 >= indices.shape[0]:
- if weights[ic1] > weight:
- i_swap = ic1
- else:
- break
- elif weights[ic1] >= weights[ic2]:
- if weight < weights[ic1]:
- i_swap = ic1
- else:
- break
- else:
- if weight < weights[ic2]:
- i_swap = ic2
- else:
- break
-
- weights[i] = weights[i_swap]
- indices[i] = indices[i_swap]
- is_new[i] = is_new[i_swap]
-
- i = i_swap
-
- weights[i] = weight
- indices[i] = index
- is_new[i] = flag
-
- return 1
-
-
- at numba.jit(
- locals={
- "indices": numba.types.int32[::1],
- "weights": numba.types.float32[::1],
- "is_new": numba.types.uint8[::1],
- "i": numba.types.uint16,
- "ic1": numba.types.uint16,
- "ic2": numba.types.uint16,
- "i_swap": numba.types.uint16,
- "heap_size": numba.types.uint16,
- }
-)
-def unchecked_heap_push(heap, row, weight, index, flag):
- """Push a new element onto the heap. The heap stores potential neighbors
- for each graph_data point. The ``row`` parameter determines which graph_data point we
- are addressing, the ``weight`` determines the distance (for heap sorting),
- the ``index`` is the element to add, and the flag determines whether this
- is to be considered a new addition.
-
- Parameters
- ----------
- heap: ndarray generated by ``make_heap``
- The heap object to push into
-
- row: int
- Which actual heap within the heap object to push to
-
- weight: float
- The priority value of the element to push onto the heap
-
- index: int
- The actual value to be pushed
-
- flag: int
- Whether to flag the newly added element or not.
-
- Returns
- -------
- success: The number of new elements successfully pushed into the heap.
- """
- if weight >= heap[1][row, 0]:
- return 0
-
- indices = heap[0][row]
- weights = heap[1][row]
- is_new = heap[2][row]
-
- # insert val at position zero
- weights[0] = weight
- indices[0] = index
- is_new[0] = flag
-
- heap_size = indices.shape[0]
-
- # descend the heap, swapping values until the max heap criterion is met
- i = 0
- while True:
- ic1 = 2 * i + 1
- ic2 = ic1 + 1
-
- if ic1 >= heap_size:
- break
- elif ic2 >= heap_size:
- if weights[ic1] > weight:
- i_swap = ic1
- else:
- break
- elif weights[ic1] >= weights[ic2]:
- if weight < weights[ic1]:
- i_swap = ic1
- else:
- break
- else:
- if weight < weights[ic2]:
- i_swap = ic2
- else:
- break
-
- weights[i] = weights[i_swap]
- indices[i] = indices[i_swap]
- is_new[i] = is_new[i_swap]
-
- i = i_swap
-
- weights[i] = weight
- indices[i] = index
- is_new[i] = flag
-
- return 1
-
-
- at numba.njit()
+ at numba.njit(cache=True)
def siftdown(heap1, heap2, elt):
"""Restore the heap property for a heap with an out of place element
at position ``elt``. This works with a heap pair where heap1 carries
@@ -416,7 +223,7 @@ def siftdown(heap1, heap2, elt):
elt = swap
- at numba.njit()
+ at numba.njit(cache=True)
def deheap_sort(heap):
"""Given an array of heaps (of graph_indices and weights), unpack the heap
out to give and array of sorted lists of graph_indices and weights by increasing
@@ -460,51 +267,47 @@ def deheap_sort(heap):
return indices.astype(np.int64), weights
- at numba.njit()
-def smallest_flagged(heap, row):
- """Search the heap for the smallest element that is
- still flagged.
-
- Parameters
- ----------
- heap: array of shape (3, n_samples, n_neighbors)
- The heaps to search
+# @numba.njit()
+# def smallest_flagged(heap, row):
+# """Search the heap for the smallest element that is
+# still flagged.
+#
+# Parameters
+# ----------
+# heap: array of shape (3, n_samples, n_neighbors)
+# The heaps to search
+#
+# row: int
+# Which of the heaps to search
+#
+# Returns
+# -------
+# index: int
+# The index of the smallest flagged element
+# of the ``row``th heap, or -1 if no flagged
+# elements remain in the heap.
+# """
+# ind = heap[0][row]
+# dist = heap[1][row]
+# flag = heap[2][row]
+#
+# min_dist = np.inf
+# result_index = -1
+#
+# for i in range(ind.shape[0]):
+# if flag[i] == 1 and dist[i] < min_dist:
+# min_dist = dist[i]
+# result_index = i
+#
+# if result_index >= 0:
+# flag[result_index] = 0.0
+# return int(ind[result_index])
+# else:
+# return -1
- row: int
- Which of the heaps to search
- Returns
- -------
- index: int
- The index of the smallest flagged element
- of the ``row``th heap, or -1 if no flagged
- elements remain in the heap.
- """
- ind = heap[0][row]
- dist = heap[1][row]
- flag = heap[2][row]
-
- min_dist = np.inf
- result_index = -1
-
- for i in range(ind.shape[0]):
- if flag[i] == 1 and dist[i] < min_dist:
- min_dist = dist[i]
- result_index = i
-
- if result_index >= 0:
- flag[result_index] = 0.0
- return int(ind[result_index])
- else:
- return -1
-
-
- at numba.njit(parallel=True, locals={"idx": numba.types.int64})
-def new_build_candidates(
- current_graph,
- max_candidates,
- rng_state,
-):
+ at numba.njit(parallel=True, locals={"idx": numba.types.int64}, cache=True)
+def new_build_candidates(current_graph, max_candidates, rng_state, n_threads):
"""Build a heap of candidate neighbors for nearest neighbor descent. For
each vertex the candidate neighbors are any current neighbors, and any
vertices that have the vertex as one of their nearest neighbors.
@@ -541,8 +344,6 @@ def new_build_candidates(
(n_vertices, max_candidates), np.inf, dtype=np.float32
)
- n_threads = numba.get_num_threads()
-
for n in numba.prange(n_threads):
local_rng_state = rng_state + n
for i in range(n_vertices):
@@ -558,10 +359,7 @@ def new_build_candidates(
if isn:
if i % n_threads == n:
checked_heap_push(
- new_candidate_priority[i],
- new_candidate_indices[i],
- d,
- idx,
+ new_candidate_priority[i], new_candidate_indices[i], d, idx
)
if idx % n_threads == n:
checked_heap_push(
@@ -573,10 +371,7 @@ def new_build_candidates(
else:
if i % n_threads == n:
checked_heap_push(
- old_candidate_priority[i],
- old_candidate_indices[i],
- d,
- idx,
+ old_candidate_priority[i], old_candidate_indices[i], d, idx
)
if idx % n_threads == n:
checked_heap_push(
@@ -601,14 +396,14 @@ def new_build_candidates(
return new_candidate_indices, old_candidate_indices
- at numba.njit("b1(u1[::1],i4)")
+ at numba.njit("b1(u1[::1],i4)", cache=True)
def has_been_visited(table, candidate):
loc = candidate >> 3
mask = 1 << (candidate & 7)
return table[loc] & mask
- at numba.njit("void(u1[::1],i4)")
+ at numba.njit("void(u1[::1],i4)", cache=True)
def mark_visited(table, candidate):
loc = candidate >> 3
mask = 1 << (candidate & 7)
@@ -626,6 +421,7 @@ def mark_visited(table, candidate):
"ic2": numba.types.uint16,
"i_swap": numba.types.uint16,
},
+ cache=True,
)
def simple_heap_push(priorities, indices, p, n):
if p >= priorities[0]:
@@ -682,6 +478,7 @@ def simple_heap_push(priorities, indices, p, n):
"ic2": numba.types.uint16,
"i_swap": numba.types.uint16,
},
+ cache=True,
)
def checked_heap_push(priorities, indices, p, n):
if p >= priorities[0]:
@@ -743,65 +540,7 @@ def checked_heap_push(priorities, indices, p, n):
"ic2": numba.types.uint16,
"i_swap": numba.types.uint16,
},
-)
-def flagged_heap_push(priorities, indices, flags, p, n, f):
- if p >= priorities[0]:
- return 0
-
- size = priorities.shape[0]
-
- # insert val at position zero
- priorities[0] = p
- indices[0] = n
- flags[0] = f
-
- # descend the heap, swapping values until the max heap criterion is met
- i = 0
- while True:
- ic1 = 2 * i + 1
- ic2 = ic1 + 1
-
- if ic1 >= size:
- break
- elif ic2 >= size:
- if priorities[ic1] > p:
- i_swap = ic1
- else:
- break
- elif priorities[ic1] >= priorities[ic2]:
- if p < priorities[ic1]:
- i_swap = ic1
- else:
- break
- else:
- if p < priorities[ic2]:
- i_swap = ic2
- else:
- break
-
- priorities[i] = priorities[i_swap]
- indices[i] = indices[i_swap]
- flags[i] = flags[i_swap]
-
- i = i_swap
-
- priorities[i] = p
- indices[i] = n
- flags[i] = f
-
- return 1
-
-
- at numba.njit(
- "i4(f4[::1],i4[::1],u1[::1],f4,i4,u1)",
- fastmath=True,
- locals={
- "size": numba.types.intp,
- "i": numba.types.uint16,
- "ic1": numba.types.uint16,
- "ic2": numba.types.uint16,
- "i_swap": numba.types.uint16,
- },
+ cache=True,
)
def checked_flagged_heap_push(priorities, indices, flags, p, n, f):
if p >= priorities[0]:
@@ -867,14 +606,15 @@ def checked_flagged_heap_push(priorities, indices, flags, p, n, f):
"i": numba.uint32,
"j": numba.uint32,
},
+ cache=True,
)
-def apply_graph_updates_low_memory(current_graph, updates):
+def apply_graph_updates_low_memory(current_graph, updates, n_threads):
n_changes = 0
priorities = current_graph[1]
indices = current_graph[0]
flags = current_graph[2]
- n_threads = numba.get_num_threads()
+ # n_threads = numba.get_num_threads()
for n in numba.prange(n_threads):
for i in range(len(updates)):
@@ -885,33 +625,21 @@ def apply_graph_updates_low_memory(current_graph, updates):
continue
if p % n_threads == n:
- # added = heap_push(current_graph, p, d, q, 1)
added = checked_flagged_heap_push(
- priorities[p],
- indices[p],
- flags[p],
- d,
- q,
- 1,
+ priorities[p], indices[p], flags[p], d, q, 1
)
n_changes += added
if q % n_threads == n:
- # added = heap_push(current_graph, q, d, p, 1)
added = checked_flagged_heap_push(
- priorities[q],
- indices[q],
- flags[q],
- d,
- p,
- 1,
+ priorities[q], indices[q], flags[q], d, p, 1
)
n_changes += added
return n_changes
- at numba.njit(locals={"p": numba.types.int64, "q": numba.types.int64})
+ at numba.njit(locals={"p": numba.types.int64, "q": numba.types.int64}, cache=True)
def apply_graph_updates_high_memory(current_graph, updates, in_graph):
n_changes = 0
@@ -928,8 +656,7 @@ def apply_graph_updates_high_memory(current_graph, updates, in_graph):
elif q in in_graph[p]:
pass
else:
- # added = unchecked_heap_push(current_graph, p, d, q, 1)
- added = flagged_heap_push(
+ added = checked_flagged_heap_push(
current_graph[1][p],
current_graph[0][p],
current_graph[2][p],
@@ -945,8 +672,7 @@ def apply_graph_updates_high_memory(current_graph, updates, in_graph):
if p == q or p in in_graph[q]:
pass
else:
- # added = unchecked_heap_push(current_graph, q, d, p, 1)
- added = flagged_heap_push(
+ added = checked_flagged_heap_push(
current_graph[1][p],
current_graph[0][p],
current_graph[2][p],
@@ -962,7 +688,7 @@ def apply_graph_updates_high_memory(current_graph, updates, in_graph):
return n_changes
- at numba.njit()
+ at numba.njit(cache=True)
def initalize_heap_from_graph_indices(heap, graph_indices, data, metric):
for i in range(graph_indices.shape[0]):
@@ -970,12 +696,12 @@ def initalize_heap_from_graph_indices(heap, graph_indices, data, metric):
j = graph_indices[i, idx]
if j >= 0:
d = metric(data[i], data[j])
- flagged_heap_push(heap[1][i], heap[0][i], heap[2][i], d, j, 1)
+ checked_flagged_heap_push(heap[1][i], heap[0][i], heap[2][i], d, j, 1)
return heap
- at numba.njit(parallel=True)
+ at numba.njit(parallel=True, cache=True)
def sparse_initalize_heap_from_graph_indices(
heap, graph_indices, data_indptr, data_indices, data_vals, metric
):
@@ -988,8 +714,7 @@ def sparse_initalize_heap_from_graph_indices(
ind2 = data_indices[data_indptr[j] : data_indptr[j + 1]]
data2 = data_vals[data_indptr[j] : data_indptr[j + 1]]
d = metric(ind1, data1, ind2, data2)
- # unchecked_heap_push(heap, i, d, j, 1)
- flagged_heap_push(heap[0][i], heap[1][i], heap[2][i], j, d, 1)
+ checked_flagged_heap_push(heap[1][i], heap[0][i], heap[2][i], d, j, 1)
return heap
=====================================
requirements.txt
=====================================
@@ -1,4 +1,5 @@
joblib
+numpy>=1.17
scikit-learn>=0.18
scipy>=1.0
numba>=0.51.2
=====================================
setup.py
=====================================
@@ -8,7 +8,7 @@ def readme():
configuration = {
"name": "pynndescent",
- "version": "0.5.2",
+ "version": "0.5.6",
"description": "Nearest Neighbor Descent",
"long_description": readme(),
"classifiers": [
View it on GitLab: https://salsa.debian.org/python-team/packages/python-pynndescent/-/compare/b22097cdca16e1e495d1ef891823075062f6a443...f37b4e890adb00bd0fd8ba94d48974686f809547
--
View it on GitLab: https://salsa.debian.org/python-team/packages/python-pynndescent/-/compare/b22097cdca16e1e495d1ef891823075062f6a443...f37b4e890adb00bd0fd8ba94d48974686f809547
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20220218/a56c0cda/attachment-0001.htm>
More information about the debian-med-commit
mailing list