[med-svn] [Git][python-team/packages/python-pynndescent][master] 6 commits: routine-update: New upstream version
Andreas Tille (@tille)
gitlab at salsa.debian.org
Thu Dec 14 10:41:43 GMT 2023
Andreas Tille pushed to branch master at Debian Python Team / packages / python-pynndescent
Commits:
e5d9909b by Andreas Tille at 2023-12-14T11:10:56+01:00
routine-update: New upstream version
- - - - -
27116056 by Andreas Tille at 2023-12-14T11:10:57+01:00
New upstream version 0.5.11
- - - - -
fbb63112 by Andreas Tille at 2023-12-14T11:10:58+01:00
Update upstream source from tag 'upstream/0.5.11'
Update to upstream version '0.5.11'
with Debian dir 579d7377d8aabbd8e41308e87f5e81099845f18f
- - - - -
1478e110 by Andreas Tille at 2023-12-14T11:11:02+01:00
routine-update: Build-Depends: s/dh-python/dh-sequence-python3/
- - - - -
8d8c0161 by Andreas Tille at 2023-12-14T11:16:43+01:00
Update patches
- - - - -
b972a7f4 by Andreas Tille at 2023-12-14T11:20:14+01:00
TODO: bug #1057598 FTBFS: ImportError: Numba could not be imported.
- - - - -
21 changed files:
- PKG-INFO
- README.rst
- debian/changelog
- debian/control
- debian/patches/arm.patch
- − debian/patches/scipy.patch
- debian/patches/series
- debian/rules
- pynndescent.egg-info/PKG-INFO
- pynndescent.egg-info/SOURCES.txt
- pynndescent/distances.py
- pynndescent/pynndescent_.py
- pynndescent/rp_trees.py
- pynndescent/sparse_nndescent.py
- pynndescent/tests/conftest.py
- + pynndescent/tests/test_data/cosine_near_duplicates.npy
- + pynndescent/tests/test_data/pynndescent_bug_np.npz
- pynndescent/tests/test_distances.py
- pynndescent/tests/test_pynndescent_.py
- pynndescent/utils.py
- setup.py
Changes:
=====================================
PKG-INFO
=====================================
@@ -1,6 +1,6 @@
-Metadata-Version: 1.2
+Metadata-Version: 2.1
Name: pynndescent
-Version: 0.5.8
+Version: 0.5.11
Summary: Nearest Neighbor Descent
Home-page: http://github.com/lmcinnes/pynndescent
Author: Leland McInnes
@@ -8,219 +8,11 @@ Author-email: leland.mcinnes at gmail.com
Maintainer: Leland McInnes
Maintainer-email: leland.mcinnes at gmail.com
License: BSD
-Description: .. image:: doc/pynndescent_logo.png
- :width: 600
- :align: center
- :alt: PyNNDescent Logo
-
- .. image:: https://dev.azure.com/lelandmcinnes/UMAP%20project%20builds/_apis/build/status/lmcinnes.pynndescent?branchName=master
- :target: .. _build_status: https://dev.azure.com/lelandmcinnes/UMAP%20project%20builds/_build/latest?definitionId=2&branchName=master
- :alt: Azure Pipelines Build Status
- .. image:: https://img.shields.io/lgtm/alerts/g/lmcinnes/pynndescent.svg
- :target: https://lgtm.com/projects/g/lmcinnes/pynndescent/alerts
- :alt: LGTM Alerts
- .. image:: https://img.shields.io/lgtm/grade/python/g/lmcinnes/pynndescent.svg
- :target: https://lgtm.com/projects/g/lmcinnes/pynndescent/context:python
- :alt: LGTM Grade
- .. image:: https://readthedocs.org/projects/pynndescent/badge/?version=latest
- :target: https://pynndescent.readthedocs.io/en/latest/?badge=latest
- :alt: Documentation Status
-
- ===========
- PyNNDescent
- ===========
-
- PyNNDescent is a Python nearest neighbor descent for approximate nearest neighbors.
- It provides a python implementation of Nearest Neighbor
- Descent for k-neighbor-graph construction and approximate nearest neighbor
- search, as per the paper:
-
- Dong, Wei, Charikar Moses, and Kai Li.
- *"Efficient k-nearest neighbor graph construction for generic similarity
- measures."*
- Proceedings of the 20th international conference on World wide web. ACM, 2011.
-
- This library supplements that approach with the use of random projection trees for
- initialisation. This can be particularly useful for the metrics that are
- amenable to such approaches (euclidean, minkowski, angular, cosine, etc.). Graph
- diversification is also performed, pruning the longest edges of any triangles in the
- graph.
-
- Currently this library targets relatively high accuracy
- (80%-100% accuracy rate) approximate nearest neighbor searches.
-
- --------------------
- Why use PyNNDescent?
- --------------------
-
- PyNNDescent provides fast approximate nearest neighbor queries. The
- `ann-benchmarks <https://github.com/erikbern/ann-benchmarks>`_ system puts it
- solidly in the mix of top performing ANN libraries:
-
- **SIFT-128 Euclidean**
-
- .. image:: https://pynndescent.readthedocs.io/en/latest/_images/sift.png
- :alt: ANN benchmark performance for SIFT 128 dataset
-
- **NYTimes-256 Angular**
-
- .. image:: https://pynndescent.readthedocs.io/en/latest/_images/nytimes.png
- :alt: ANN benchmark performance for NYTimes 256 dataset
-
- While PyNNDescent is among fastest ANN library, it is also both easy to install (pip
- and conda installable) with no platform or compilation issues, and is very flexible,
- supporting a wide variety of distance metrics by default:
-
- **Minkowski style metrics**
-
- - euclidean
- - manhattan
- - chebyshev
- - minkowski
-
- **Miscellaneous spatial metrics**
-
- - canberra
- - braycurtis
- - haversine
-
- **Normalized spatial metrics**
-
- - mahalanobis
- - wminkowski
- - seuclidean
-
- **Angular and correlation metrics**
-
- - cosine
- - dot
- - correlation
- - spearmanr
- - tsss
- - true_angular
-
- **Probability metrics**
-
- - hellinger
- - wasserstein
-
- **Metrics for binary data**
-
- - hamming
- - jaccard
- - dice
- - russelrao
- - kulsinski
- - rogerstanimoto
- - sokalmichener
- - sokalsneath
- - yule
-
- and also custom user defined distance metrics while still retaining performance.
-
- PyNNDescent also integrates well with Scikit-learn, including providing support
- for the KNeighborTransformer as a drop in replacement for algorithms
- that make use of nearest neighbor computations.
-
- ----------------------
- How to use PyNNDescent
- ----------------------
-
- PyNNDescent aims to have a very simple interface. It is similar to (but more
- limited than) KDTrees and BallTrees in ``sklearn``. In practice there are
- only two operations -- index construction, and querying an index for nearest
- neighbors.
-
- To build a new search index on some training data ``data`` you can do something
- like
-
- .. code:: python
-
- from pynndescent import NNDescent
- index = NNDescent(data)
-
- You can then use the index for searching (and can pickle it to disk if you
- wish). To search a pynndescent index for the 15 nearest neighbors of a test data
- set ``query_data`` you can do something like
-
- .. code:: python
-
- index.query(query_data, k=15)
-
- and that is pretty much all there is to it. You can find more details in the
- `documentation <https://pynndescent.readthedocs.org>`_.
-
- ----------
- Installing
- ----------
-
- PyNNDescent is designed to be easy to install being a pure python module with
- relatively light requirements:
-
- * numpy
- * scipy
- * scikit-learn >= 0.22
- * numba >= 0.51
-
- all of which should be pip or conda installable. The easiest way to install should be
- via conda:
-
- .. code:: bash
-
- conda install -c conda-forge pynndescent
-
- or via pip:
-
- .. code:: bash
-
- pip install pynndescent
-
- To manually install this package:
-
- .. code:: bash
-
- wget https://github.com/lmcinnes/pynndescent/archive/master.zip
- unzip master.zip
- rm master.zip
- cd pynndescent-master
- python setup.py install
-
- ----------------
- Help and Support
- ----------------
-
- This project is still young. The documentation is still growing. In the meantime please
- `open an issue <https://github.com/lmcinnes/pynndescent/issues/new>`_
- and I will try to provide any help and guidance that I can. Please also check
- the docstrings on the code, which provide some descriptions of the parameters.
-
- -------
- License
- -------
-
- The pynndescent package is 2-clause BSD licensed. Enjoy.
-
- ------------
- Contributing
- ------------
-
- Contributions are more than welcome! There are lots of opportunities
- for potential projects, so please get in touch if you would like to
- help out. Everything from code to notebooks to
- examples and documentation are all *equally valuable* so please don't feel
- you can't contribute. To contribute please `fork the project <https://github.com/lmcinnes/pynndescent/issues#fork-destination-box>`_ make your changes and
- submit a pull request. We will do our best to work through any issues with
- you and get your code merged into the main branch.
-
-
-
Keywords: nearest neighbor,knn,ANN
-Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved
-Classifier: Programming Language :: C
Classifier: Programming Language :: Python
Classifier: Topic :: Software Development
Classifier: Topic :: Scientific/Engineering
@@ -231,3 +23,204 @@ Classifier: Operating System :: MacOS
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
+License-File: LICENSE
+
+.. image:: doc/pynndescent_logo.png
+ :width: 600
+ :align: center
+ :alt: PyNNDescent Logo
+
+.. image:: https://dev.azure.com/TutteInstitute/build-pipelines/_apis/build/status%2Flmcinnes.pynndescent?branchName=master
+ :target: https://dev.azure.com/TutteInstitute/build-pipelines/_build?definitionId=17
+ :alt: Azure Pipelines Build Status
+.. image:: https://readthedocs.org/projects/pynndescent/badge/?version=latest
+ :target: https://pynndescent.readthedocs.io/en/latest/?badge=latest
+ :alt: Documentation Status
+
+===========
+PyNNDescent
+===========
+
+PyNNDescent is a Python nearest neighbor descent for approximate nearest neighbors.
+It provides a python implementation of Nearest Neighbor
+Descent for k-neighbor-graph construction and approximate nearest neighbor
+search, as per the paper:
+
+Dong, Wei, Charikar Moses, and Kai Li.
+*"Efficient k-nearest neighbor graph construction for generic similarity
+measures."*
+Proceedings of the 20th international conference on World wide web. ACM, 2011.
+
+This library supplements that approach with the use of random projection trees for
+initialisation. This can be particularly useful for the metrics that are
+amenable to such approaches (euclidean, minkowski, angular, cosine, etc.). Graph
+diversification is also performed, pruning the longest edges of any triangles in the
+graph.
+
+Currently this library targets relatively high accuracy
+(80%-100% accuracy rate) approximate nearest neighbor searches.
+
+--------------------
+Why use PyNNDescent?
+--------------------
+
+PyNNDescent provides fast approximate nearest neighbor queries. The
+`ann-benchmarks <https://github.com/erikbern/ann-benchmarks>`_ system puts it
+solidly in the mix of top performing ANN libraries:
+
+**SIFT-128 Euclidean**
+
+.. image:: https://pynndescent.readthedocs.io/en/latest/_images/sift.png
+ :alt: ANN benchmark performance for SIFT 128 dataset
+
+**NYTimes-256 Angular**
+
+.. image:: https://pynndescent.readthedocs.io/en/latest/_images/nytimes.png
+ :alt: ANN benchmark performance for NYTimes 256 dataset
+
+While PyNNDescent is among fastest ANN library, it is also both easy to install (pip
+and conda installable) with no platform or compilation issues, and is very flexible,
+supporting a wide variety of distance metrics by default:
+
+**Minkowski style metrics**
+
+- euclidean
+- manhattan
+- chebyshev
+- minkowski
+
+**Miscellaneous spatial metrics**
+
+- canberra
+- braycurtis
+- haversine
+
+**Normalized spatial metrics**
+
+- mahalanobis
+- wminkowski
+- seuclidean
+
+**Angular and correlation metrics**
+
+- cosine
+- dot
+- correlation
+- spearmanr
+- tsss
+- true_angular
+
+**Probability metrics**
+
+- hellinger
+- wasserstein
+
+**Metrics for binary data**
+
+- hamming
+- jaccard
+- dice
+- russelrao
+- kulsinski
+- rogerstanimoto
+- sokalmichener
+- sokalsneath
+- yule
+
+and also custom user defined distance metrics while still retaining performance.
+
+PyNNDescent also integrates well with Scikit-learn, including providing support
+for the KNeighborTransformer as a drop in replacement for algorithms
+that make use of nearest neighbor computations.
+
+----------------------
+How to use PyNNDescent
+----------------------
+
+PyNNDescent aims to have a very simple interface. It is similar to (but more
+limited than) KDTrees and BallTrees in ``sklearn``. In practice there are
+only two operations -- index construction, and querying an index for nearest
+neighbors.
+
+To build a new search index on some training data ``data`` you can do something
+like
+
+.. code:: python
+
+ from pynndescent import NNDescent
+ index = NNDescent(data)
+
+You can then use the index for searching (and can pickle it to disk if you
+wish). To search a pynndescent index for the 15 nearest neighbors of a test data
+set ``query_data`` you can do something like
+
+.. code:: python
+
+ index.query(query_data, k=15)
+
+and that is pretty much all there is to it. You can find more details in the
+`documentation <https://pynndescent.readthedocs.org>`_.
+
+----------
+Installing
+----------
+
+PyNNDescent is designed to be easy to install being a pure python module with
+relatively light requirements:
+
+* numpy
+* scipy
+* scikit-learn >= 0.22
+* numba >= 0.51
+
+all of which should be pip or conda installable. The easiest way to install should be
+via conda:
+
+.. code:: bash
+
+ conda install -c conda-forge pynndescent
+
+or via pip:
+
+.. code:: bash
+
+ pip install pynndescent
+
+To manually install this package:
+
+.. code:: bash
+
+ wget https://github.com/lmcinnes/pynndescent/archive/master.zip
+ unzip master.zip
+ rm master.zip
+ cd pynndescent-master
+ python setup.py install
+
+----------------
+Help and Support
+----------------
+
+This project is still young. The documentation is still growing. In the meantime please
+`open an issue <https://github.com/lmcinnes/pynndescent/issues/new>`_
+and I will try to provide any help and guidance that I can. Please also check
+the docstrings on the code, which provide some descriptions of the parameters.
+
+-------
+License
+-------
+
+The pynndescent package is 2-clause BSD licensed. Enjoy.
+
+------------
+Contributing
+------------
+
+Contributions are more than welcome! There are lots of opportunities
+for potential projects, so please get in touch if you would like to
+help out. Everything from code to notebooks to
+examples and documentation are all *equally valuable* so please don't feel
+you can't contribute. To contribute please `fork the project <https://github.com/lmcinnes/pynndescent/issues#fork-destination-box>`_ make your changes and
+submit a pull request. We will do our best to work through any issues with
+you and get your code merged into the main branch.
+
+
=====================================
README.rst
=====================================
@@ -3,15 +3,9 @@
:align: center
:alt: PyNNDescent Logo
-.. image:: https://dev.azure.com/lelandmcinnes/UMAP%20project%20builds/_apis/build/status/lmcinnes.pynndescent?branchName=master
- :target: .. _build_status: https://dev.azure.com/lelandmcinnes/UMAP%20project%20builds/_build/latest?definitionId=2&branchName=master
+.. image:: https://dev.azure.com/TutteInstitute/build-pipelines/_apis/build/status%2Flmcinnes.pynndescent?branchName=master
+ :target: https://dev.azure.com/TutteInstitute/build-pipelines/_build?definitionId=17
:alt: Azure Pipelines Build Status
-.. image:: https://img.shields.io/lgtm/alerts/g/lmcinnes/pynndescent.svg
- :target: https://lgtm.com/projects/g/lmcinnes/pynndescent/alerts
- :alt: LGTM Alerts
-.. image:: https://img.shields.io/lgtm/grade/python/g/lmcinnes/pynndescent.svg
- :target: https://lgtm.com/projects/g/lmcinnes/pynndescent/context:python
- :alt: LGTM Grade
.. image:: https://readthedocs.org/projects/pynndescent/badge/?version=latest
:target: https://pynndescent.readthedocs.io/en/latest/?badge=latest
:alt: Documentation Status
=====================================
debian/changelog
=====================================
@@ -1,3 +1,11 @@
+python-pynndescent (0.5.11-1) UNRELEASED; urgency=medium
+
+ * New upstream version
+ * Build-Depends: s/dh-python/dh-sequence-python3/ (routine-update)
+ TODO: bug #1057598 FTBFS: ImportError: Numba could not be imported.
+
+ -- Andreas Tille <tille at debian.org> Thu, 14 Dec 2023 11:10:56 +0100
+
python-pynndescent (0.5.8-2) unstable; urgency=medium
* Team Upload.
=====================================
debian/control
=====================================
@@ -4,7 +4,7 @@ Uploaders: Andreas Tille <tille at debian.org>
Section: python
Priority: optional
Build-Depends: debhelper-compat (= 13),
- dh-python,
+ dh-sequence-python3,
python3-setuptools,
python3-all,
python3-joblib <!nocheck>,
=====================================
debian/patches/arm.patch
=====================================
@@ -3,7 +3,7 @@ Author: Nilesh Patra <nilesh at debian.org>
Last-Update: 2023-02-12
--- a/pynndescent/tests/test_distances.py
+++ b/pynndescent/tests/test_distances.py
-@@ -9,6 +9,7 @@
+@@ -9,6 +9,7 @@ from scipy.version import full_version a
from sklearn.metrics import pairwise_distances
from sklearn.neighbors import BallTree
from sklearn.preprocessing import normalize
@@ -11,7 +11,7 @@ Last-Update: 2023-02-12
@pytest.mark.parametrize(
-@@ -107,6 +108,9 @@
+@@ -106,6 +107,9 @@ def test_binary_check(binary_data, metri
],
)
def test_sparse_spatial_check(sparse_spatial_data, metric, decimal=6):
@@ -21,7 +21,7 @@ Last-Update: 2023-02-12
if metric in spdist.sparse_named_distances:
dist_matrix = pairwise_distances(
np.asarray(sparse_spatial_data.todense()).astype(np.float32), metric=metric
-@@ -333,6 +337,9 @@
+@@ -331,6 +335,9 @@ def test_alternative_distances():
def test_jensen_shannon():
@@ -31,7 +31,7 @@ Last-Update: 2023-02-12
test_data = np.random.random(size=(10, 50))
test_data = normalize(test_data, norm="l1")
for i in range(test_data.shape[0]):
-@@ -349,6 +356,9 @@
+@@ -347,6 +354,9 @@ def test_jensen_shannon():
def test_sparse_jensen_shannon():
@@ -43,7 +43,7 @@ Last-Update: 2023-02-12
test_data[test_data <= 0.5] = 0.0
--- a/pynndescent/tests/test_pynndescent_.py
+++ b/pynndescent/tests/test_pynndescent_.py
-@@ -11,9 +11,13 @@
+@@ -11,9 +11,13 @@ from sklearn.preprocessing import normal
import pickle
import joblib
import scipy
=====================================
debian/patches/scipy.patch deleted
=====================================
@@ -1,29 +0,0 @@
-From 00444be2107b71169b853847e7b334623c58a4e3 Mon Sep 17 00:00:00 2001
-From: Leland McInnes <leland.mcinnes at gmail.com>
-Date: Tue, 3 Jan 2023 10:53:15 -0500
-Subject: [PATCH] Update tests to fix #207
-
----
- pynndescent/tests/test_distances.py | 2 +-
- 1 file changed, 1 insertion(+), 1 deletion(-)
-
---- a/pynndescent/tests/test_distances.py
-+++ b/pynndescent/tests/test_distances.py
-@@ -109,7 +109,7 @@
- def test_sparse_spatial_check(sparse_spatial_data, metric, decimal=6):
- if metric in spdist.sparse_named_distances:
- dist_matrix = pairwise_distances(
-- sparse_spatial_data.todense().astype(np.float32), metric=metric
-+ np.asarray(sparse_spatial_data.todense()).astype(np.float32), metric=metric
- )
- if metric in ("braycurtis", "dice", "sokalsneath", "yule"):
- dist_matrix[np.where(~np.isfinite(dist_matrix))] = 0.0
-@@ -174,7 +174,7 @@
- )
- def test_sparse_binary_check(sparse_binary_data, metric):
- if metric in spdist.sparse_named_distances:
-- dist_matrix = pairwise_distances(sparse_binary_data.todense(), metric=metric)
-+ dist_matrix = pairwise_distances(np.asarray(sparse_binary_data.todense()), metric=metric)
- if metric in ("jaccard", "dice", "sokalsneath"):
- dist_matrix[np.where(~np.isfinite(dist_matrix))] = 0.0
- if metric in ("kulsinski", "russellrao"):
=====================================
debian/patches/series
=====================================
@@ -1,2 +1 @@
-scipy.patch
arm.patch
=====================================
debian/rules
=====================================
@@ -4,7 +4,7 @@ export PYBUILD_NAME=pynndescent
export PYTHONPATH=$(CURDIR)
%:
- dh $@ --with python3 --buildsystem=pybuild
+ dh $@ --buildsystem=pybuild
override_dh_auto_test:
ifeq (,$(filter nocheck,$(DEB_BUILD_OPTIONS)))
=====================================
pynndescent.egg-info/PKG-INFO
=====================================
@@ -1,6 +1,6 @@
-Metadata-Version: 1.2
+Metadata-Version: 2.1
Name: pynndescent
-Version: 0.5.8
+Version: 0.5.11
Summary: Nearest Neighbor Descent
Home-page: http://github.com/lmcinnes/pynndescent
Author: Leland McInnes
@@ -8,219 +8,11 @@ Author-email: leland.mcinnes at gmail.com
Maintainer: Leland McInnes
Maintainer-email: leland.mcinnes at gmail.com
License: BSD
-Description: .. image:: doc/pynndescent_logo.png
- :width: 600
- :align: center
- :alt: PyNNDescent Logo
-
- .. image:: https://dev.azure.com/lelandmcinnes/UMAP%20project%20builds/_apis/build/status/lmcinnes.pynndescent?branchName=master
- :target: .. _build_status: https://dev.azure.com/lelandmcinnes/UMAP%20project%20builds/_build/latest?definitionId=2&branchName=master
- :alt: Azure Pipelines Build Status
- .. image:: https://img.shields.io/lgtm/alerts/g/lmcinnes/pynndescent.svg
- :target: https://lgtm.com/projects/g/lmcinnes/pynndescent/alerts
- :alt: LGTM Alerts
- .. image:: https://img.shields.io/lgtm/grade/python/g/lmcinnes/pynndescent.svg
- :target: https://lgtm.com/projects/g/lmcinnes/pynndescent/context:python
- :alt: LGTM Grade
- .. image:: https://readthedocs.org/projects/pynndescent/badge/?version=latest
- :target: https://pynndescent.readthedocs.io/en/latest/?badge=latest
- :alt: Documentation Status
-
- ===========
- PyNNDescent
- ===========
-
- PyNNDescent is a Python nearest neighbor descent for approximate nearest neighbors.
- It provides a python implementation of Nearest Neighbor
- Descent for k-neighbor-graph construction and approximate nearest neighbor
- search, as per the paper:
-
- Dong, Wei, Charikar Moses, and Kai Li.
- *"Efficient k-nearest neighbor graph construction for generic similarity
- measures."*
- Proceedings of the 20th international conference on World wide web. ACM, 2011.
-
- This library supplements that approach with the use of random projection trees for
- initialisation. This can be particularly useful for the metrics that are
- amenable to such approaches (euclidean, minkowski, angular, cosine, etc.). Graph
- diversification is also performed, pruning the longest edges of any triangles in the
- graph.
-
- Currently this library targets relatively high accuracy
- (80%-100% accuracy rate) approximate nearest neighbor searches.
-
- --------------------
- Why use PyNNDescent?
- --------------------
-
- PyNNDescent provides fast approximate nearest neighbor queries. The
- `ann-benchmarks <https://github.com/erikbern/ann-benchmarks>`_ system puts it
- solidly in the mix of top performing ANN libraries:
-
- **SIFT-128 Euclidean**
-
- .. image:: https://pynndescent.readthedocs.io/en/latest/_images/sift.png
- :alt: ANN benchmark performance for SIFT 128 dataset
-
- **NYTimes-256 Angular**
-
- .. image:: https://pynndescent.readthedocs.io/en/latest/_images/nytimes.png
- :alt: ANN benchmark performance for NYTimes 256 dataset
-
- While PyNNDescent is among fastest ANN library, it is also both easy to install (pip
- and conda installable) with no platform or compilation issues, and is very flexible,
- supporting a wide variety of distance metrics by default:
-
- **Minkowski style metrics**
-
- - euclidean
- - manhattan
- - chebyshev
- - minkowski
-
- **Miscellaneous spatial metrics**
-
- - canberra
- - braycurtis
- - haversine
-
- **Normalized spatial metrics**
-
- - mahalanobis
- - wminkowski
- - seuclidean
-
- **Angular and correlation metrics**
-
- - cosine
- - dot
- - correlation
- - spearmanr
- - tsss
- - true_angular
-
- **Probability metrics**
-
- - hellinger
- - wasserstein
-
- **Metrics for binary data**
-
- - hamming
- - jaccard
- - dice
- - russelrao
- - kulsinski
- - rogerstanimoto
- - sokalmichener
- - sokalsneath
- - yule
-
- and also custom user defined distance metrics while still retaining performance.
-
- PyNNDescent also integrates well with Scikit-learn, including providing support
- for the KNeighborTransformer as a drop in replacement for algorithms
- that make use of nearest neighbor computations.
-
- ----------------------
- How to use PyNNDescent
- ----------------------
-
- PyNNDescent aims to have a very simple interface. It is similar to (but more
- limited than) KDTrees and BallTrees in ``sklearn``. In practice there are
- only two operations -- index construction, and querying an index for nearest
- neighbors.
-
- To build a new search index on some training data ``data`` you can do something
- like
-
- .. code:: python
-
- from pynndescent import NNDescent
- index = NNDescent(data)
-
- You can then use the index for searching (and can pickle it to disk if you
- wish). To search a pynndescent index for the 15 nearest neighbors of a test data
- set ``query_data`` you can do something like
-
- .. code:: python
-
- index.query(query_data, k=15)
-
- and that is pretty much all there is to it. You can find more details in the
- `documentation <https://pynndescent.readthedocs.org>`_.
-
- ----------
- Installing
- ----------
-
- PyNNDescent is designed to be easy to install being a pure python module with
- relatively light requirements:
-
- * numpy
- * scipy
- * scikit-learn >= 0.22
- * numba >= 0.51
-
- all of which should be pip or conda installable. The easiest way to install should be
- via conda:
-
- .. code:: bash
-
- conda install -c conda-forge pynndescent
-
- or via pip:
-
- .. code:: bash
-
- pip install pynndescent
-
- To manually install this package:
-
- .. code:: bash
-
- wget https://github.com/lmcinnes/pynndescent/archive/master.zip
- unzip master.zip
- rm master.zip
- cd pynndescent-master
- python setup.py install
-
- ----------------
- Help and Support
- ----------------
-
- This project is still young. The documentation is still growing. In the meantime please
- `open an issue <https://github.com/lmcinnes/pynndescent/issues/new>`_
- and I will try to provide any help and guidance that I can. Please also check
- the docstrings on the code, which provide some descriptions of the parameters.
-
- -------
- License
- -------
-
- The pynndescent package is 2-clause BSD licensed. Enjoy.
-
- ------------
- Contributing
- ------------
-
- Contributions are more than welcome! There are lots of opportunities
- for potential projects, so please get in touch if you would like to
- help out. Everything from code to notebooks to
- examples and documentation are all *equally valuable* so please don't feel
- you can't contribute. To contribute please `fork the project <https://github.com/lmcinnes/pynndescent/issues#fork-destination-box>`_ make your changes and
- submit a pull request. We will do our best to work through any issues with
- you and get your code merged into the main branch.
-
-
-
Keywords: nearest neighbor,knn,ANN
-Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved
-Classifier: Programming Language :: C
Classifier: Programming Language :: Python
Classifier: Topic :: Software Development
Classifier: Topic :: Scientific/Engineering
@@ -231,3 +23,204 @@ Classifier: Operating System :: MacOS
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
+License-File: LICENSE
+
+.. image:: doc/pynndescent_logo.png
+ :width: 600
+ :align: center
+ :alt: PyNNDescent Logo
+
+.. image:: https://dev.azure.com/TutteInstitute/build-pipelines/_apis/build/status%2Flmcinnes.pynndescent?branchName=master
+ :target: https://dev.azure.com/TutteInstitute/build-pipelines/_build?definitionId=17
+ :alt: Azure Pipelines Build Status
+.. image:: https://readthedocs.org/projects/pynndescent/badge/?version=latest
+ :target: https://pynndescent.readthedocs.io/en/latest/?badge=latest
+ :alt: Documentation Status
+
+===========
+PyNNDescent
+===========
+
+PyNNDescent is a Python nearest neighbor descent for approximate nearest neighbors.
+It provides a python implementation of Nearest Neighbor
+Descent for k-neighbor-graph construction and approximate nearest neighbor
+search, as per the paper:
+
+Dong, Wei, Charikar Moses, and Kai Li.
+*"Efficient k-nearest neighbor graph construction for generic similarity
+measures."*
+Proceedings of the 20th international conference on World wide web. ACM, 2011.
+
+This library supplements that approach with the use of random projection trees for
+initialisation. This can be particularly useful for the metrics that are
+amenable to such approaches (euclidean, minkowski, angular, cosine, etc.). Graph
+diversification is also performed, pruning the longest edges of any triangles in the
+graph.
+
+Currently this library targets relatively high accuracy
+(80%-100% accuracy rate) approximate nearest neighbor searches.
+
+--------------------
+Why use PyNNDescent?
+--------------------
+
+PyNNDescent provides fast approximate nearest neighbor queries. The
+`ann-benchmarks <https://github.com/erikbern/ann-benchmarks>`_ system puts it
+solidly in the mix of top performing ANN libraries:
+
+**SIFT-128 Euclidean**
+
+.. image:: https://pynndescent.readthedocs.io/en/latest/_images/sift.png
+ :alt: ANN benchmark performance for SIFT 128 dataset
+
+**NYTimes-256 Angular**
+
+.. image:: https://pynndescent.readthedocs.io/en/latest/_images/nytimes.png
+ :alt: ANN benchmark performance for NYTimes 256 dataset
+
+While PyNNDescent is among fastest ANN library, it is also both easy to install (pip
+and conda installable) with no platform or compilation issues, and is very flexible,
+supporting a wide variety of distance metrics by default:
+
+**Minkowski style metrics**
+
+- euclidean
+- manhattan
+- chebyshev
+- minkowski
+
+**Miscellaneous spatial metrics**
+
+- canberra
+- braycurtis
+- haversine
+
+**Normalized spatial metrics**
+
+- mahalanobis
+- wminkowski
+- seuclidean
+
+**Angular and correlation metrics**
+
+- cosine
+- dot
+- correlation
+- spearmanr
+- tsss
+- true_angular
+
+**Probability metrics**
+
+- hellinger
+- wasserstein
+
+**Metrics for binary data**
+
+- hamming
+- jaccard
+- dice
+- russelrao
+- kulsinski
+- rogerstanimoto
+- sokalmichener
+- sokalsneath
+- yule
+
+and also custom user defined distance metrics while still retaining performance.
+
+PyNNDescent also integrates well with Scikit-learn, including providing support
+for the KNeighborTransformer as a drop in replacement for algorithms
+that make use of nearest neighbor computations.
+
+----------------------
+How to use PyNNDescent
+----------------------
+
+PyNNDescent aims to have a very simple interface. It is similar to (but more
+limited than) KDTrees and BallTrees in ``sklearn``. In practice there are
+only two operations -- index construction, and querying an index for nearest
+neighbors.
+
+To build a new search index on some training data ``data`` you can do something
+like
+
+.. code:: python
+
+ from pynndescent import NNDescent
+ index = NNDescent(data)
+
+You can then use the index for searching (and can pickle it to disk if you
+wish). To search a pynndescent index for the 15 nearest neighbors of a test data
+set ``query_data`` you can do something like
+
+.. code:: python
+
+ index.query(query_data, k=15)
+
+and that is pretty much all there is to it. You can find more details in the
+`documentation <https://pynndescent.readthedocs.org>`_.
+
+----------
+Installing
+----------
+
+PyNNDescent is designed to be easy to install being a pure python module with
+relatively light requirements:
+
+* numpy
+* scipy
+* scikit-learn >= 0.22
+* numba >= 0.51
+
+all of which should be pip or conda installable. The easiest way to install should be
+via conda:
+
+.. code:: bash
+
+ conda install -c conda-forge pynndescent
+
+or via pip:
+
+.. code:: bash
+
+ pip install pynndescent
+
+To manually install this package:
+
+.. code:: bash
+
+ wget https://github.com/lmcinnes/pynndescent/archive/master.zip
+ unzip master.zip
+ rm master.zip
+ cd pynndescent-master
+ python setup.py install
+
+----------------
+Help and Support
+----------------
+
+This project is still young. The documentation is still growing. In the meantime please
+`open an issue <https://github.com/lmcinnes/pynndescent/issues/new>`_
+and I will try to provide any help and guidance that I can. Please also check
+the docstrings on the code, which provide some descriptions of the parameters.
+
+-------
+License
+-------
+
+The pynndescent package is 2-clause BSD licensed. Enjoy.
+
+------------
+Contributing
+------------
+
+Contributions are more than welcome! There are lots of opportunities
+for potential projects, so please get in touch if you would like to
+help out. Everything from code to notebooks to
+examples and documentation are all *equally valuable* so please don't feel
+you can't contribute. To contribute please `fork the project <https://github.com/lmcinnes/pynndescent/issues#fork-destination-box>`_ make your changes and
+submit a pull request. We will do our best to work through any issues with
+you and get your code merged into the main branch.
+
+
=====================================
pynndescent.egg-info/SOURCES.txt
=====================================
@@ -26,4 +26,6 @@ pynndescent/tests/conftest.py
pynndescent/tests/test_distances.py
pynndescent/tests/test_pynndescent_.py
pynndescent/tests/test_rank.py
-pynndescent/tests/test_data/cosine_hang.npy
\ No newline at end of file
+pynndescent/tests/test_data/cosine_hang.npy
+pynndescent/tests/test_data/cosine_near_duplicates.npy
+pynndescent/tests/test_data/pynndescent_bug_np.npz
\ No newline at end of file
=====================================
pynndescent/distances.py
=====================================
@@ -695,15 +695,10 @@ def rankdata(a, method="average"):
@numba.njit(fastmath=True)
def spearmanr(x, y):
- a = np.column_stack((x, y))
+ x_rank = rankdata(x)
+ y_rank = rankdata(y)
- n_vars = a.shape[1]
-
- for i in range(n_vars):
- a[:, i] = rankdata(a[:, i])
- rs = np.corrcoef(a, rowvar=0)
-
- return rs[1, 0]
+ return correlation(x_rank, y_rank)
@numba.njit(nogil=True)
=====================================
pynndescent/pynndescent_.py
=====================================
@@ -93,7 +93,7 @@ def generate_leaf_updates(leaf_block, dist_thresholds, data, dist):
return updates
- at numba.njit(locals={"d": numba.float32, "p": numba.int32, "q": numba.int32}, cache=True)
+ at numba.njit(locals={"d": numba.float32, "p": numba.int32, "q": numba.int32}, cache=False)
def init_rp_tree(data, dist, current_graph, leaf_array):
n_leaves = leaf_array.shape[0]
@@ -137,7 +137,7 @@ def init_rp_tree(data, dist, current_graph, leaf_array):
@numba.njit(
fastmath=True,
locals={"d": numba.float32, "idx": numba.int32, "i": numba.int32},
- cache=True,
+ cache=False,
)
def init_random(n_neighbors, data, heap, dist, rng_state):
for i in range(data.shape[0]):
@@ -199,7 +199,7 @@ def generate_graph_updates(
return updates
- at numba.njit(cache=True)
+ at numba.njit(cache=False)
def process_candidates(
data,
dist,
@@ -630,6 +630,13 @@ class NNDescent:
non-negligible computation cost in building the index. Don't tweak
this value unless you know what you're doing.
+ max_rptree_depth: int (optional, default=100)
+ Maximum depth of random projection trees. Increasing this may result in a
+ richer, deeper random projection forest, but it may be composed of many
+ degenerate branches. Increase leaf_size in order to keep shallower, wider
+ nondegenerate trees. Such wide trees, however, may yield poor performance
+ of the preparation of the NN descent.
+
n_iters: int (optional, default=None)
The maximum number of NN-descent iterations to perform. The
NN-descent algorithm can abort early if limited progress is being
@@ -679,6 +686,7 @@ class NNDescent:
random_state=None,
low_memory=True,
max_candidates=None,
+ max_rptree_depth=200,
n_iters=None,
delta=0.001,
n_jobs=None,
@@ -702,6 +710,7 @@ class NNDescent:
self.prune_degree_multiplier = pruning_degree_multiplier
self.diversify_prob = diversify_prob
self.n_search_trees = n_search_trees
+ self.max_rptree_depth = max_rptree_depth
self.max_candidates = max_candidates
self.low_memory = low_memory
self.n_iters = n_iters
@@ -776,6 +785,7 @@ class NNDescent:
if metric == "dot":
data = normalize(data, norm="l2", copy=copy_on_normalize)
+ self._raw_data = data
self.rng_state = current_random_state.randint(INT32_MIN, INT32_MAX, 3).astype(
np.int64
@@ -799,6 +809,7 @@ class NNDescent:
current_random_state,
self.n_jobs,
self._angular_trees,
+ max_depth=self.max_rptree_depth,
)
leaf_array = rptree_leaf_array(self._rp_forest)
else:
@@ -988,6 +999,7 @@ class NNDescent:
current_random_state,
self.n_jobs,
self._angular_trees,
+ max_depth=self.max_rptree_depth,
)
self._search_forest = [
convert_tree_format(
@@ -1838,6 +1850,7 @@ class NNDescent:
current_random_state,
self.n_jobs,
self._angular_trees,
+ max_depth=self.max_rptree_depth,
)
leaf_array = rptree_leaf_array(self._rp_forest)
current_graph = make_heap(self._raw_data.shape[0], self.n_neighbors)
=====================================
pynndescent/rp_trees.py
=====================================
@@ -141,6 +141,18 @@ def angular_random_projection_split(data, indices, rng_state):
side[i] = 1
n_right += 1
+ # If all points end up on one side, something went wrong numerically
+ # In this case, assign points randomly; they are likely very close anyway
+ if n_left == 0 or n_right == 0:
+ n_left = 0
+ n_right = 0
+ for i in range(indices.shape[0]):
+ side[i] = tau_rand_int(rng_state) % 2
+ if side[i] == 0:
+ n_left += 1
+ else:
+ n_right += 1
+
# Now that we have the counts allocate arrays
indices_left = np.empty(n_left, dtype=np.int32)
indices_right = np.empty(n_right, dtype=np.int32)
@@ -248,6 +260,18 @@ def euclidean_random_projection_split(data, indices, rng_state):
side[i] = 1
n_right += 1
+ # If all points end up on one side, something went wrong numerically
+ # In this case, assign points randomly; they are likely very close anyway
+ if n_left == 0 or n_right == 0:
+ n_left = 0
+ n_right = 0
+ for i in range(indices.shape[0]):
+ side[i] = tau_rand_int(rng_state) % 2
+ if side[i] == 0:
+ n_left += 1
+ else:
+ n_right += 1
+
# Now that we have the counts allocate arrays
indices_left = np.empty(n_left, dtype=np.int32)
indices_right = np.empty(n_right, dtype=np.int32)
@@ -372,6 +396,18 @@ def sparse_angular_random_projection_split(inds, indptr, data, indices, rng_stat
side[i] = 1
n_right += 1
+ # If all points end up on one side, something went wrong numerically
+ # In this case, assign points randomly; they are likely very close anyway
+ if n_left == 0 or n_right == 0:
+ n_left = 0
+ n_right = 0
+ for i in range(indices.shape[0]):
+ side[i] = tau_rand_int(rng_state) % 2
+ if side[i] == 0:
+ n_left += 1
+ else:
+ n_right += 1
+
# Now that we have the counts allocate arrays
indices_left = np.empty(n_left, dtype=np.int32)
indices_right = np.empty(n_right, dtype=np.int32)
@@ -479,6 +515,18 @@ def sparse_euclidean_random_projection_split(inds, indptr, data, indices, rng_st
side[i] = 1
n_right += 1
+ # If all points end up on one side, something went wrong numerically
+ # In this case, assign points randomly; they are likely very close anyway
+ if n_left == 0 or n_right == 0:
+ n_left = 0
+ n_right = 0
+ for i in range(indices.shape[0]):
+ side[i] = abs(tau_rand_int(rng_state)) % 2
+ if side[i] == 0:
+ n_left += 1
+ else:
+ n_right += 1
+
# Now that we have the counts allocate arrays
indices_left = np.empty(n_left, dtype=np.int32)
indices_right = np.empty(n_right, dtype=np.int32)
@@ -501,7 +549,6 @@ def sparse_euclidean_random_projection_split(inds, indptr, data, indices, rng_st
@numba.njit(
nogil=True,
- cache=True,
locals={"left_node_num": numba.types.int32, "right_node_num": numba.types.int32},
)
def make_euclidean_tree(
@@ -513,8 +560,9 @@ def make_euclidean_tree(
point_indices,
rng_state,
leaf_size=30,
+ max_depth=200,
):
- if indices.shape[0] > leaf_size:
+ if indices.shape[0] > leaf_size and max_depth > 0:
(
left_indices,
right_indices,
@@ -531,6 +579,7 @@ def make_euclidean_tree(
point_indices,
rng_state,
leaf_size,
+ max_depth - 1,
)
left_node_num = len(point_indices) - 1
@@ -544,6 +593,7 @@ def make_euclidean_tree(
point_indices,
rng_state,
leaf_size,
+ max_depth - 1,
)
right_node_num = len(point_indices) - 1
@@ -563,7 +613,6 @@ def make_euclidean_tree(
@numba.njit(
nogil=True,
- cache=True,
locals={
"children": numba.types.ListType(children_type),
"left_node_num": numba.types.int32,
@@ -579,8 +628,9 @@ def make_angular_tree(
point_indices,
rng_state,
leaf_size=30,
+ max_depth=200,
):
- if indices.shape[0] > leaf_size:
+ if indices.shape[0] > leaf_size and max_depth > 0:
(
left_indices,
right_indices,
@@ -597,6 +647,7 @@ def make_angular_tree(
point_indices,
rng_state,
leaf_size,
+ max_depth - 1,
)
left_node_num = len(point_indices) - 1
@@ -610,6 +661,7 @@ def make_angular_tree(
point_indices,
rng_state,
leaf_size,
+ max_depth - 1,
)
right_node_num = len(point_indices) - 1
@@ -629,7 +681,6 @@ def make_angular_tree(
@numba.njit(
nogil=True,
- cache=True,
locals={"left_node_num": numba.types.int32, "right_node_num": numba.types.int32},
)
def make_sparse_euclidean_tree(
@@ -643,8 +694,9 @@ def make_sparse_euclidean_tree(
point_indices,
rng_state,
leaf_size=30,
+ max_depth=200,
):
- if indices.shape[0] > leaf_size:
+ if indices.shape[0] > leaf_size and max_depth > 0:
(
left_indices,
right_indices,
@@ -665,6 +717,7 @@ def make_sparse_euclidean_tree(
point_indices,
rng_state,
leaf_size,
+ max_depth - 1,
)
left_node_num = len(point_indices) - 1
@@ -680,6 +733,7 @@ def make_sparse_euclidean_tree(
point_indices,
rng_state,
leaf_size,
+ max_depth - 1,
)
right_node_num = len(point_indices) - 1
@@ -699,7 +753,6 @@ def make_sparse_euclidean_tree(
@numba.njit(
nogil=True,
- cache=True,
locals={"left_node_num": numba.types.int32, "right_node_num": numba.types.int32},
)
def make_sparse_angular_tree(
@@ -713,8 +766,9 @@ def make_sparse_angular_tree(
point_indices,
rng_state,
leaf_size=30,
+ max_depth=200,
):
- if indices.shape[0] > leaf_size:
+ if indices.shape[0] > leaf_size and max_depth > 0:
(
left_indices,
right_indices,
@@ -735,6 +789,7 @@ def make_sparse_angular_tree(
point_indices,
rng_state,
leaf_size,
+ max_depth - 1,
)
left_node_num = len(point_indices) - 1
@@ -750,6 +805,7 @@ def make_sparse_angular_tree(
point_indices,
rng_state,
leaf_size,
+ max_depth - 1,
)
right_node_num = len(point_indices) - 1
@@ -765,8 +821,8 @@ def make_sparse_angular_tree(
point_indices.append(indices)
- at numba.njit(nogil=True, cache=True)
-def make_dense_tree(data, rng_state, leaf_size=30, angular=False):
+ at numba.njit(nogil=True)
+def make_dense_tree(data, rng_state, leaf_size=30, angular=False, max_depth=200):
indices = np.arange(data.shape[0]).astype(np.int32)
hyperplanes = numba.typed.List.empty_list(dense_hyperplane_type)
@@ -784,6 +840,7 @@ def make_dense_tree(data, rng_state, leaf_size=30, angular=False):
point_indices,
rng_state,
leaf_size,
+ max_depth=max_depth,
)
else:
make_euclidean_tree(
@@ -795,14 +852,28 @@ def make_dense_tree(data, rng_state, leaf_size=30, angular=False):
point_indices,
rng_state,
leaf_size,
+ max_depth=max_depth,
)
- result = FlatTree(hyperplanes, offsets, children, point_indices, leaf_size)
+ max_leaf_size = leaf_size
+ for points in point_indices:
+ if len(points) > max_leaf_size:
+ max_leaf_size = numba.int32(len(points))
+
+ result = FlatTree(hyperplanes, offsets, children, point_indices, max_leaf_size)
return result
- at numba.njit(nogil=True, cache=True)
-def make_sparse_tree(inds, indptr, spdata, rng_state, leaf_size=30, angular=False):
+ at numba.njit(nogil=True)
+def make_sparse_tree(
+ inds,
+ indptr,
+ spdata,
+ rng_state,
+ leaf_size=30,
+ angular=False,
+ max_depth=200,
+):
indices = np.arange(indptr.shape[0] - 1).astype(np.int32)
hyperplanes = numba.typed.List.empty_list(sparse_hyperplane_type)
@@ -822,6 +893,7 @@ def make_sparse_tree(inds, indptr, spdata, rng_state, leaf_size=30, angular=Fals
point_indices,
rng_state,
leaf_size,
+ max_depth=max_depth,
)
else:
make_sparse_euclidean_tree(
@@ -835,9 +907,15 @@ def make_sparse_tree(inds, indptr, spdata, rng_state, leaf_size=30, angular=Fals
point_indices,
rng_state,
leaf_size,
+ max_depth=max_depth,
)
- return FlatTree(hyperplanes, offsets, children, point_indices, leaf_size)
+ max_leaf_size = leaf_size
+ for points in point_indices:
+ if len(points) > max_leaf_size:
+ max_leaf_size = numba.int32(len(points))
+
+ return FlatTree(hyperplanes, offsets, children, point_indices, max_leaf_size)
@numba.njit(
@@ -956,6 +1034,7 @@ def make_forest(
random_state,
n_jobs=None,
angular=False,
+ max_depth=200,
):
"""Build a random projection forest with ``n_trees``.
@@ -993,12 +1072,19 @@ def make_forest(
rng_states[i],
leaf_size,
angular,
+ max_depth=max_depth,
)
for i in range(n_trees)
)
else:
result = joblib.Parallel(n_jobs=n_jobs, require="sharedmem")(
- joblib.delayed(make_dense_tree)(data, rng_states[i], leaf_size, angular)
+ joblib.delayed(make_dense_tree)(
+ data,
+ rng_states[i],
+ leaf_size,
+ angular,
+ max_depth=max_depth
+ )
for i in range(n_trees)
)
except (RuntimeError, RecursionError, SystemError):
@@ -1011,14 +1097,14 @@ def make_forest(
return tuple(result)
- at numba.njit(nogil=True, cache=True)
-def get_leaves_from_tree(tree):
+ at numba.njit(nogil=True)
+def get_leaves_from_tree(tree, max_leaf_size):
n_leaves = 0
for i in range(len(tree.children)):
if tree.children[i][0] == -1 and tree.children[i][1] == -1:
n_leaves += 1
- result = np.full((n_leaves, tree.leaf_size), -1, dtype=np.int32)
+ result = np.full((n_leaves, max_leaf_size), -1, dtype=np.int32)
leaf_index = 0
for i in range(len(tree.indices)):
if tree.children[i][0] == -1 or tree.children[i][1] == -1:
@@ -1030,8 +1116,9 @@ def get_leaves_from_tree(tree):
def rptree_leaf_array_parallel(rp_forest):
+ max_leaf_size = np.max([rp_tree.leaf_size for rp_tree in rp_forest])
result = joblib.Parallel(n_jobs=-1, require="sharedmem")(
- joblib.delayed(get_leaves_from_tree)(rp_tree) for rp_tree in rp_forest
+ joblib.delayed(get_leaves_from_tree)(rp_tree, max_leaf_size) for rp_tree in rp_forest
)
return result
@@ -1047,7 +1134,6 @@ def rptree_leaf_array(rp_forest):
def recursive_convert(
tree, hyperplanes, offsets, children, indices, node_num, leaf_start, tree_node
):
-
if tree.children[tree_node][0] < 0:
leaf_end = leaf_start + len(tree.indices[tree_node])
children[node_num, 0] = -leaf_start
@@ -1087,7 +1173,6 @@ def recursive_convert(
def recursive_convert_sparse(
tree, hyperplanes, offsets, children, indices, node_num, leaf_start, tree_node
):
-
if tree.children[tree_node][0] < 0:
leaf_end = leaf_start + len(tree.indices[tree_node])
children[node_num, 0] = -leaf_start
@@ -1176,7 +1261,6 @@ FLAT_TREE_LEAF_SIZE = 4
def denumbaify_tree(tree):
-
result = (
tree.hyperplanes,
tree.offsets,
@@ -1189,7 +1273,6 @@ def denumbaify_tree(tree):
def renumbaify_tree(tree):
-
result = FlatTree(
tree[FLAT_TREE_HYPERPLANES],
tree[FLAT_TREE_OFFSETS],
=====================================
pynndescent/sparse_nndescent.py
=====================================
@@ -53,7 +53,7 @@ def generate_leaf_updates(leaf_block, dist_thresholds, inds, indptr, data, dist)
return updates
- at numba.njit(locals={"d": numba.float32, "p": numba.int32, "q": numba.int32}, cache=True)
+ at numba.njit(locals={"d": numba.float32, "p": numba.int32, "q": numba.int32}, cache=False)
def init_rp_tree(inds, indptr, data, dist, current_graph, leaf_array):
n_leaves = leaf_array.shape[0]
@@ -99,7 +99,7 @@ def init_rp_tree(inds, indptr, data, dist, current_graph, leaf_array):
@numba.njit(
fastmath=True,
locals={"d": numba.float32, "i": numba.int32, "idx": numba.int32},
- cache=True,
+ cache=False,
)
def init_random(n_neighbors, inds, indptr, data, heap, dist, rng_state):
n_samples = indptr.shape[0] - 1
=====================================
pynndescent/tests/conftest.py
=====================================
@@ -63,6 +63,13 @@ def cosine_hang_data():
return np.load(data_path)
+ at pytest.fixture
+def cosine_near_duplicates_data():
+ this_dir = os.path.dirname(os.path.abspath(__file__))
+ data_path = os.path.join(this_dir, "test_data/cosine_near_duplicates.npy")
+ return np.load(data_path)
+
+
@pytest.fixture
def small_data():
return np.random.uniform(40, 5, size=(20, 5))
=====================================
pynndescent/tests/test_data/cosine_near_duplicates.npy
=====================================
Binary files /dev/null and b/pynndescent/tests/test_data/cosine_near_duplicates.npy differ
=====================================
pynndescent/tests/test_data/pynndescent_bug_np.npz
=====================================
Binary files /dev/null and b/pynndescent/tests/test_data/pynndescent_bug_np.npz differ
=====================================
pynndescent/tests/test_distances.py
=====================================
@@ -58,7 +58,6 @@ def test_spatial_check(spatial_data, metric):
"jaccard",
"matching",
"dice",
- "kulsinski",
"rogerstanimoto",
"russellrao",
"sokalmichener",
@@ -70,7 +69,7 @@ def test_binary_check(binary_data, metric):
dist_matrix = pairwise_distances(binary_data, metric=metric)
if metric in ("jaccard", "dice", "sokalsneath", "yule"):
dist_matrix[np.where(~np.isfinite(dist_matrix))] = 0.0
- if metric in ("kulsinski", "russellrao"):
+ if metric == "russellrao":
dist_matrix[np.where(~np.isfinite(dist_matrix))] = 0.0
# And because distance between all zero vectors should be zero
dist_matrix[10, 11] = 0.0
@@ -109,11 +108,11 @@ def test_binary_check(binary_data, metric):
def test_sparse_spatial_check(sparse_spatial_data, metric, decimal=6):
if metric in spdist.sparse_named_distances:
dist_matrix = pairwise_distances(
- sparse_spatial_data.todense().astype(np.float32), metric=metric
+ np.asarray(sparse_spatial_data.todense()).astype(np.float32), metric=metric
)
if metric in ("braycurtis", "dice", "sokalsneath", "yule"):
dist_matrix[np.where(~np.isfinite(dist_matrix))] = 0.0
- if metric in ("cosine", "correlation", "kulsinski", "russellrao"):
+ if metric in ("cosine", "correlation", "russellrao"):
dist_matrix[np.where(~np.isfinite(dist_matrix))] = 1.0
# And because distance between all zero vectors should be zero
dist_matrix[10, 11] = 0.0
@@ -165,7 +164,6 @@ def test_sparse_spatial_check(sparse_spatial_data, metric, decimal=6):
"jaccard",
"matching",
"dice",
- "kulsinski",
"rogerstanimoto",
"russellrao",
"sokalmichener",
@@ -174,10 +172,10 @@ def test_sparse_spatial_check(sparse_spatial_data, metric, decimal=6):
)
def test_sparse_binary_check(sparse_binary_data, metric):
if metric in spdist.sparse_named_distances:
- dist_matrix = pairwise_distances(sparse_binary_data.todense(), metric=metric)
+ dist_matrix = pairwise_distances(np.asarray(sparse_binary_data.todense()), metric=metric)
if metric in ("jaccard", "dice", "sokalsneath"):
dist_matrix[np.where(~np.isfinite(dist_matrix))] = 0.0
- if metric in ("kulsinski", "russellrao"):
+ if metric == "russellrao":
dist_matrix[np.where(~np.isfinite(dist_matrix))] = 1.0
# And because distance between all zero vectors should be zero
dist_matrix[10, 11] = 0.0
@@ -309,7 +307,7 @@ def test_spearmanr():
scipy_expected = stats.spearmanr(x, y)
r = dist.spearmanr(x, y)
- assert_array_almost_equal(r, scipy_expected.correlation)
+ assert_array_almost_equal(r, 1 - scipy_expected.correlation)
def test_alternative_distances():
=====================================
pynndescent/tests/test_pynndescent_.py
=====================================
@@ -287,6 +287,24 @@ def test_deduplicated_data_behaves_normally(seed, cosine_hang_data):
), "NN-descent did not get 95% accuracy on nearest neighbors"
+def test_rp_trees_should_not_stack_overflow_with_near_duplicate_data(seed, cosine_near_duplicates_data):
+
+ n_neighbors = 10
+ knn_indices, _ = NNDescent(
+ cosine_near_duplicates_data,
+ "cosine",
+ {},
+ n_neighbors,
+ random_state=np.random.RandomState(seed),
+ n_trees=20,
+ )._neighbor_graph
+
+ for i in range(cosine_near_duplicates_data.shape[0]):
+ assert len(knn_indices[i]) == len(
+ np.unique(knn_indices[i])
+ ), "Duplicate graph_indices in knn graph"
+
+
def test_output_when_verbose_is_true(spatial_data, seed):
out = io.StringIO()
with redirect_stdout(out):
@@ -663,3 +681,8 @@ def test_tree_no_split(small_data, sparse_small_data, metric):
), "NN-descent query did not get 95% for accuracy on nearest neighbors on {} data".format(
data_type
)
+
+ at pytest.mark.skipif('NUMBA_DISABLE_JIT' in os.environ, reason="Too expensive for disabled Numba")
+def test_bad_data():
+ data = np.sqrt(np.load("pynndescent/tests/test_data/pynndescent_bug_np.npz")['arr_0'])
+ index = NNDescent(data, metric="cosine")
=====================================
pynndescent/utils.py
=====================================
@@ -676,7 +676,7 @@ def apply_graph_updates_high_memory(current_graph, updates, in_graph):
return n_changes
- at numba.njit(cache=True)
+ at numba.njit(cache=False)
def initalize_heap_from_graph_indices(heap, graph_indices, data, metric):
for i in range(graph_indices.shape[0]):
=====================================
setup.py
=====================================
@@ -8,7 +8,7 @@ def readme():
configuration = {
"name": "pynndescent",
- "version": "0.5.8",
+ "version": "0.5.11",
"description": "Nearest Neighbor Descent",
"long_description": readme(),
"classifiers": [
@@ -16,7 +16,6 @@ configuration = {
"Intended Audience :: Science/Research",
"Intended Audience :: Developers",
"License :: OSI Approved",
- "Programming Language :: C",
"Programming Language :: Python",
"Topic :: Software Development",
"Topic :: Scientific/Engineering",
View it on GitLab: https://salsa.debian.org/python-team/packages/python-pynndescent/-/compare/d6ce6292c883607cd6aebe8537436f5708106c48...b972a7f420e3ebcc0186eabd61dd5b4461183b63
--
View it on GitLab: https://salsa.debian.org/python-team/packages/python-pynndescent/-/compare/d6ce6292c883607cd6aebe8537436f5708106c48...b972a7f420e3ebcc0186eabd61dd5b4461183b63
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20231214/1c8e2fdb/attachment-0001.htm>
More information about the debian-med-commit
mailing list