[med-svn] [Git][python-team/packages/python-pynndescent][master] 6 commits: routine-update: New upstream version

Andreas Tille (@tille) gitlab at salsa.debian.org
Thu Dec 14 10:41:43 GMT 2023



Andreas Tille pushed to branch master at Debian Python Team / packages / python-pynndescent


Commits:
e5d9909b by Andreas Tille at 2023-12-14T11:10:56+01:00
routine-update: New upstream version

- - - - -
27116056 by Andreas Tille at 2023-12-14T11:10:57+01:00
New upstream version 0.5.11
- - - - -
fbb63112 by Andreas Tille at 2023-12-14T11:10:58+01:00
Update upstream source from tag 'upstream/0.5.11'

Update to upstream version '0.5.11'
with Debian dir 579d7377d8aabbd8e41308e87f5e81099845f18f
- - - - -
1478e110 by Andreas Tille at 2023-12-14T11:11:02+01:00
routine-update: Build-Depends: s/dh-python/dh-sequence-python3/

- - - - -
8d8c0161 by Andreas Tille at 2023-12-14T11:16:43+01:00
Update patches

- - - - -
b972a7f4 by Andreas Tille at 2023-12-14T11:20:14+01:00
TODO: bug #1057598  FTBFS: ImportError: Numba could not be imported.

- - - - -


21 changed files:

- PKG-INFO
- README.rst
- debian/changelog
- debian/control
- debian/patches/arm.patch
- − debian/patches/scipy.patch
- debian/patches/series
- debian/rules
- pynndescent.egg-info/PKG-INFO
- pynndescent.egg-info/SOURCES.txt
- pynndescent/distances.py
- pynndescent/pynndescent_.py
- pynndescent/rp_trees.py
- pynndescent/sparse_nndescent.py
- pynndescent/tests/conftest.py
- + pynndescent/tests/test_data/cosine_near_duplicates.npy
- + pynndescent/tests/test_data/pynndescent_bug_np.npz
- pynndescent/tests/test_distances.py
- pynndescent/tests/test_pynndescent_.py
- pynndescent/utils.py
- setup.py


Changes:

=====================================
PKG-INFO
=====================================
@@ -1,6 +1,6 @@
-Metadata-Version: 1.2
+Metadata-Version: 2.1
 Name: pynndescent
-Version: 0.5.8
+Version: 0.5.11
 Summary: Nearest Neighbor Descent
 Home-page: http://github.com/lmcinnes/pynndescent
 Author: Leland McInnes
@@ -8,219 +8,11 @@ Author-email: leland.mcinnes at gmail.com
 Maintainer: Leland McInnes
 Maintainer-email: leland.mcinnes at gmail.com
 License: BSD
-Description: .. image:: doc/pynndescent_logo.png
-          :width: 600
-          :align: center
-          :alt: PyNNDescent Logo
-        
-        .. image:: https://dev.azure.com/lelandmcinnes/UMAP%20project%20builds/_apis/build/status/lmcinnes.pynndescent?branchName=master
-            :target: .. _build_status: https://dev.azure.com/lelandmcinnes/UMAP%20project%20builds/_build/latest?definitionId=2&branchName=master
-            :alt: Azure Pipelines Build Status
-        .. image:: https://img.shields.io/lgtm/alerts/g/lmcinnes/pynndescent.svg
-            :target: https://lgtm.com/projects/g/lmcinnes/pynndescent/alerts
-            :alt: LGTM Alerts
-        .. image:: https://img.shields.io/lgtm/grade/python/g/lmcinnes/pynndescent.svg
-            :target: https://lgtm.com/projects/g/lmcinnes/pynndescent/context:python
-            :alt: LGTM Grade
-        .. image:: https://readthedocs.org/projects/pynndescent/badge/?version=latest
-            :target: https://pynndescent.readthedocs.io/en/latest/?badge=latest
-            :alt: Documentation Status
-        
-        ===========
-        PyNNDescent
-        ===========
-        
-        PyNNDescent is a Python nearest neighbor descent for approximate nearest neighbors.
-        It provides a python implementation of Nearest Neighbor
-        Descent for k-neighbor-graph construction and approximate nearest neighbor
-        search, as per the paper:
-        
-        Dong, Wei, Charikar Moses, and Kai Li.
-        *"Efficient k-nearest neighbor graph construction for generic similarity
-        measures."*
-        Proceedings of the 20th international conference on World wide web. ACM, 2011.
-        
-        This library supplements that approach with the use of random projection trees for
-        initialisation. This can be particularly useful for the metrics that are
-        amenable to such approaches (euclidean, minkowski, angular, cosine, etc.). Graph
-        diversification is also performed, pruning the longest edges of any triangles in the
-        graph.
-        
-        Currently this library targets relatively high accuracy 
-        (80%-100% accuracy rate) approximate nearest neighbor searches.
-        
-        --------------------
-        Why use PyNNDescent?
-        --------------------
-        
-        PyNNDescent provides fast approximate nearest neighbor queries. The
-        `ann-benchmarks <https://github.com/erikbern/ann-benchmarks>`_ system puts it
-        solidly in the mix of top performing ANN libraries:
-        
-        **SIFT-128 Euclidean**
-        
-        .. image:: https://pynndescent.readthedocs.io/en/latest/_images/sift.png
-            :alt: ANN benchmark performance for SIFT 128 dataset
-        
-        **NYTimes-256 Angular**
-        
-        .. image:: https://pynndescent.readthedocs.io/en/latest/_images/nytimes.png
-            :alt: ANN benchmark performance for NYTimes 256 dataset
-        
-        While PyNNDescent is among fastest ANN library, it is also both easy to install (pip
-        and conda installable) with no platform or compilation issues, and is very flexible,
-        supporting a wide variety of distance metrics by default:
-        
-        **Minkowski style metrics**
-        
-        - euclidean
-        - manhattan
-        - chebyshev
-        - minkowski
-        
-        **Miscellaneous spatial metrics**
-        
-        - canberra
-        - braycurtis
-        - haversine
-        
-        **Normalized spatial metrics**
-        
-        - mahalanobis
-        - wminkowski
-        - seuclidean
-        
-        **Angular and correlation metrics**
-        
-        - cosine
-        - dot
-        - correlation
-        - spearmanr
-        - tsss
-        - true_angular
-        
-        **Probability metrics**
-        
-        - hellinger
-        - wasserstein
-        
-        **Metrics for binary data**
-        
-        - hamming
-        - jaccard
-        - dice
-        - russelrao
-        - kulsinski
-        - rogerstanimoto
-        - sokalmichener
-        - sokalsneath
-        - yule
-        
-        and also custom user defined distance metrics while still retaining performance.
-        
-        PyNNDescent also integrates well with Scikit-learn, including providing support
-        for the KNeighborTransformer as a drop in replacement for algorithms
-        that make use of nearest neighbor computations.
-        
-        ----------------------
-        How to use PyNNDescent
-        ----------------------
-        
-        PyNNDescent aims to have a very simple interface. It is similar to (but more
-        limited than) KDTrees and BallTrees in ``sklearn``. In practice there are
-        only two operations -- index construction, and querying an index for nearest
-        neighbors.
-        
-        To build a new search index on some training data ``data`` you can do something
-        like
-        
-        .. code:: python
-        
-            from pynndescent import NNDescent
-            index = NNDescent(data)
-        
-        You can then use the index for searching (and can pickle it to disk if you
-        wish). To search a pynndescent index for the 15 nearest neighbors of a test data
-        set ``query_data`` you can do something like
-        
-        .. code:: python
-        
-            index.query(query_data, k=15)
-        
-        and that is pretty much all there is to it. You can find more details in the
-        `documentation <https://pynndescent.readthedocs.org>`_.
-        
-        ----------
-        Installing
-        ----------
-        
-        PyNNDescent is designed to be easy to install being a pure python module with
-        relatively light requirements:
-        
-        * numpy
-        * scipy
-        * scikit-learn >= 0.22
-        * numba >= 0.51
-        
-        all of which should be pip or conda installable. The easiest way to install should be
-        via conda:
-        
-        .. code:: bash
-        
-            conda install -c conda-forge pynndescent
-        
-        or via pip:
-        
-        .. code:: bash
-        
-            pip install pynndescent
-        
-        To manually install this package:
-        
-        .. code:: bash
-        
-            wget https://github.com/lmcinnes/pynndescent/archive/master.zip
-            unzip master.zip
-            rm master.zip
-            cd pynndescent-master
-            python setup.py install
-        
-        ----------------
-        Help and Support
-        ----------------
-        
-        This project is still young. The documentation is still growing. In the meantime please
-        `open an issue <https://github.com/lmcinnes/pynndescent/issues/new>`_
-        and I will try to provide any help and guidance that I can. Please also check
-        the docstrings on the code, which provide some descriptions of the parameters.
-        
-        -------
-        License
-        -------
-        
-        The pynndescent package is 2-clause BSD licensed. Enjoy.
-        
-        ------------
-        Contributing
-        ------------
-        
-        Contributions are more than welcome! There are lots of opportunities
-        for potential projects, so please get in touch if you would like to
-        help out. Everything from code to notebooks to
-        examples and documentation are all *equally valuable* so please don't feel
-        you can't contribute. To contribute please `fork the project <https://github.com/lmcinnes/pynndescent/issues#fork-destination-box>`_ make your changes and
-        submit a pull request. We will do our best to work through any issues with
-        you and get your code merged into the main branch.
-        
-        
-        
 Keywords: nearest neighbor,knn,ANN
-Platform: UNKNOWN
 Classifier: Development Status :: 3 - Alpha
 Classifier: Intended Audience :: Science/Research
 Classifier: Intended Audience :: Developers
 Classifier: License :: OSI Approved
-Classifier: Programming Language :: C
 Classifier: Programming Language :: Python
 Classifier: Topic :: Software Development
 Classifier: Topic :: Scientific/Engineering
@@ -231,3 +23,204 @@ Classifier: Operating System :: MacOS
 Classifier: Programming Language :: Python :: 3.6
 Classifier: Programming Language :: Python :: 3.7
 Classifier: Programming Language :: Python :: 3.8
+License-File: LICENSE
+
+.. image:: doc/pynndescent_logo.png
+  :width: 600
+  :align: center
+  :alt: PyNNDescent Logo
+
+.. image:: https://dev.azure.com/TutteInstitute/build-pipelines/_apis/build/status%2Flmcinnes.pynndescent?branchName=master
+    :target: https://dev.azure.com/TutteInstitute/build-pipelines/_build?definitionId=17
+    :alt: Azure Pipelines Build Status
+.. image:: https://readthedocs.org/projects/pynndescent/badge/?version=latest
+    :target: https://pynndescent.readthedocs.io/en/latest/?badge=latest
+    :alt: Documentation Status
+
+===========
+PyNNDescent
+===========
+
+PyNNDescent is a Python nearest neighbor descent for approximate nearest neighbors.
+It provides a python implementation of Nearest Neighbor
+Descent for k-neighbor-graph construction and approximate nearest neighbor
+search, as per the paper:
+
+Dong, Wei, Charikar Moses, and Kai Li.
+*"Efficient k-nearest neighbor graph construction for generic similarity
+measures."*
+Proceedings of the 20th international conference on World wide web. ACM, 2011.
+
+This library supplements that approach with the use of random projection trees for
+initialisation. This can be particularly useful for the metrics that are
+amenable to such approaches (euclidean, minkowski, angular, cosine, etc.). Graph
+diversification is also performed, pruning the longest edges of any triangles in the
+graph.
+
+Currently this library targets relatively high accuracy 
+(80%-100% accuracy rate) approximate nearest neighbor searches.
+
+--------------------
+Why use PyNNDescent?
+--------------------
+
+PyNNDescent provides fast approximate nearest neighbor queries. The
+`ann-benchmarks <https://github.com/erikbern/ann-benchmarks>`_ system puts it
+solidly in the mix of top performing ANN libraries:
+
+**SIFT-128 Euclidean**
+
+.. image:: https://pynndescent.readthedocs.io/en/latest/_images/sift.png
+    :alt: ANN benchmark performance for SIFT 128 dataset
+
+**NYTimes-256 Angular**
+
+.. image:: https://pynndescent.readthedocs.io/en/latest/_images/nytimes.png
+    :alt: ANN benchmark performance for NYTimes 256 dataset
+
+While PyNNDescent is among fastest ANN library, it is also both easy to install (pip
+and conda installable) with no platform or compilation issues, and is very flexible,
+supporting a wide variety of distance metrics by default:
+
+**Minkowski style metrics**
+
+- euclidean
+- manhattan
+- chebyshev
+- minkowski
+
+**Miscellaneous spatial metrics**
+
+- canberra
+- braycurtis
+- haversine
+
+**Normalized spatial metrics**
+
+- mahalanobis
+- wminkowski
+- seuclidean
+
+**Angular and correlation metrics**
+
+- cosine
+- dot
+- correlation
+- spearmanr
+- tsss
+- true_angular
+
+**Probability metrics**
+
+- hellinger
+- wasserstein
+
+**Metrics for binary data**
+
+- hamming
+- jaccard
+- dice
+- russelrao
+- kulsinski
+- rogerstanimoto
+- sokalmichener
+- sokalsneath
+- yule
+
+and also custom user defined distance metrics while still retaining performance.
+
+PyNNDescent also integrates well with Scikit-learn, including providing support
+for the KNeighborTransformer as a drop in replacement for algorithms
+that make use of nearest neighbor computations.
+
+----------------------
+How to use PyNNDescent
+----------------------
+
+PyNNDescent aims to have a very simple interface. It is similar to (but more
+limited than) KDTrees and BallTrees in ``sklearn``. In practice there are
+only two operations -- index construction, and querying an index for nearest
+neighbors.
+
+To build a new search index on some training data ``data`` you can do something
+like
+
+.. code:: python
+
+    from pynndescent import NNDescent
+    index = NNDescent(data)
+
+You can then use the index for searching (and can pickle it to disk if you
+wish). To search a pynndescent index for the 15 nearest neighbors of a test data
+set ``query_data`` you can do something like
+
+.. code:: python
+
+    index.query(query_data, k=15)
+
+and that is pretty much all there is to it. You can find more details in the
+`documentation <https://pynndescent.readthedocs.org>`_.
+
+----------
+Installing
+----------
+
+PyNNDescent is designed to be easy to install being a pure python module with
+relatively light requirements:
+
+* numpy
+* scipy
+* scikit-learn >= 0.22
+* numba >= 0.51
+
+all of which should be pip or conda installable. The easiest way to install should be
+via conda:
+
+.. code:: bash
+
+    conda install -c conda-forge pynndescent
+
+or via pip:
+
+.. code:: bash
+
+    pip install pynndescent
+
+To manually install this package:
+
+.. code:: bash
+
+    wget https://github.com/lmcinnes/pynndescent/archive/master.zip
+    unzip master.zip
+    rm master.zip
+    cd pynndescent-master
+    python setup.py install
+
+----------------
+Help and Support
+----------------
+
+This project is still young. The documentation is still growing. In the meantime please
+`open an issue <https://github.com/lmcinnes/pynndescent/issues/new>`_
+and I will try to provide any help and guidance that I can. Please also check
+the docstrings on the code, which provide some descriptions of the parameters.
+
+-------
+License
+-------
+
+The pynndescent package is 2-clause BSD licensed. Enjoy.
+
+------------
+Contributing
+------------
+
+Contributions are more than welcome! There are lots of opportunities
+for potential projects, so please get in touch if you would like to
+help out. Everything from code to notebooks to
+examples and documentation are all *equally valuable* so please don't feel
+you can't contribute. To contribute please `fork the project <https://github.com/lmcinnes/pynndescent/issues#fork-destination-box>`_ make your changes and
+submit a pull request. We will do our best to work through any issues with
+you and get your code merged into the main branch.
+
+


=====================================
README.rst
=====================================
@@ -3,15 +3,9 @@
   :align: center
   :alt: PyNNDescent Logo
 
-.. image:: https://dev.azure.com/lelandmcinnes/UMAP%20project%20builds/_apis/build/status/lmcinnes.pynndescent?branchName=master
-    :target: .. _build_status: https://dev.azure.com/lelandmcinnes/UMAP%20project%20builds/_build/latest?definitionId=2&branchName=master
+.. image:: https://dev.azure.com/TutteInstitute/build-pipelines/_apis/build/status%2Flmcinnes.pynndescent?branchName=master
+    :target: https://dev.azure.com/TutteInstitute/build-pipelines/_build?definitionId=17
     :alt: Azure Pipelines Build Status
-.. image:: https://img.shields.io/lgtm/alerts/g/lmcinnes/pynndescent.svg
-    :target: https://lgtm.com/projects/g/lmcinnes/pynndescent/alerts
-    :alt: LGTM Alerts
-.. image:: https://img.shields.io/lgtm/grade/python/g/lmcinnes/pynndescent.svg
-    :target: https://lgtm.com/projects/g/lmcinnes/pynndescent/context:python
-    :alt: LGTM Grade
 .. image:: https://readthedocs.org/projects/pynndescent/badge/?version=latest
     :target: https://pynndescent.readthedocs.io/en/latest/?badge=latest
     :alt: Documentation Status


=====================================
debian/changelog
=====================================
@@ -1,3 +1,11 @@
+python-pynndescent (0.5.11-1) UNRELEASED; urgency=medium
+
+  * New upstream version
+  * Build-Depends: s/dh-python/dh-sequence-python3/ (routine-update)
+  TODO: bug #1057598  FTBFS: ImportError: Numba could not be imported.
+
+ -- Andreas Tille <tille at debian.org>  Thu, 14 Dec 2023 11:10:56 +0100
+
 python-pynndescent (0.5.8-2) unstable; urgency=medium
 
   * Team Upload.


=====================================
debian/control
=====================================
@@ -4,7 +4,7 @@ Uploaders: Andreas Tille <tille at debian.org>
 Section: python
 Priority: optional
 Build-Depends: debhelper-compat (= 13),
-               dh-python,
+               dh-sequence-python3,
                python3-setuptools,
                python3-all,
                python3-joblib <!nocheck>,


=====================================
debian/patches/arm.patch
=====================================
@@ -3,7 +3,7 @@ Author: Nilesh Patra <nilesh at debian.org>
 Last-Update: 2023-02-12
 --- a/pynndescent/tests/test_distances.py
 +++ b/pynndescent/tests/test_distances.py
-@@ -9,6 +9,7 @@
+@@ -9,6 +9,7 @@ from scipy.version import full_version a
  from sklearn.metrics import pairwise_distances
  from sklearn.neighbors import BallTree
  from sklearn.preprocessing import normalize
@@ -11,7 +11,7 @@ Last-Update: 2023-02-12
  
  
  @pytest.mark.parametrize(
-@@ -107,6 +108,9 @@
+@@ -106,6 +107,9 @@ def test_binary_check(binary_data, metri
      ],
  )
  def test_sparse_spatial_check(sparse_spatial_data, metric, decimal=6):
@@ -21,7 +21,7 @@ Last-Update: 2023-02-12
      if metric in spdist.sparse_named_distances:
          dist_matrix = pairwise_distances(
              np.asarray(sparse_spatial_data.todense()).astype(np.float32), metric=metric
-@@ -333,6 +337,9 @@
+@@ -331,6 +335,9 @@ def test_alternative_distances():
  
  
  def test_jensen_shannon():
@@ -31,7 +31,7 @@ Last-Update: 2023-02-12
      test_data = np.random.random(size=(10, 50))
      test_data = normalize(test_data, norm="l1")
      for i in range(test_data.shape[0]):
-@@ -349,6 +356,9 @@
+@@ -347,6 +354,9 @@ def test_jensen_shannon():
  
  
  def test_sparse_jensen_shannon():
@@ -43,7 +43,7 @@ Last-Update: 2023-02-12
      test_data[test_data <= 0.5] = 0.0
 --- a/pynndescent/tests/test_pynndescent_.py
 +++ b/pynndescent/tests/test_pynndescent_.py
-@@ -11,9 +11,13 @@
+@@ -11,9 +11,13 @@ from sklearn.preprocessing import normal
  import pickle
  import joblib
  import scipy


=====================================
debian/patches/scipy.patch deleted
=====================================
@@ -1,29 +0,0 @@
-From 00444be2107b71169b853847e7b334623c58a4e3 Mon Sep 17 00:00:00 2001
-From: Leland McInnes <leland.mcinnes at gmail.com>
-Date: Tue, 3 Jan 2023 10:53:15 -0500
-Subject: [PATCH] Update tests to fix #207
-
----
- pynndescent/tests/test_distances.py | 2 +-
- 1 file changed, 1 insertion(+), 1 deletion(-)
-
---- a/pynndescent/tests/test_distances.py
-+++ b/pynndescent/tests/test_distances.py
-@@ -109,7 +109,7 @@
- def test_sparse_spatial_check(sparse_spatial_data, metric, decimal=6):
-     if metric in spdist.sparse_named_distances:
-         dist_matrix = pairwise_distances(
--            sparse_spatial_data.todense().astype(np.float32), metric=metric
-+            np.asarray(sparse_spatial_data.todense()).astype(np.float32), metric=metric
-         )
-     if metric in ("braycurtis", "dice", "sokalsneath", "yule"):
-         dist_matrix[np.where(~np.isfinite(dist_matrix))] = 0.0
-@@ -174,7 +174,7 @@
- )
- def test_sparse_binary_check(sparse_binary_data, metric):
-     if metric in spdist.sparse_named_distances:
--        dist_matrix = pairwise_distances(sparse_binary_data.todense(), metric=metric)
-+        dist_matrix = pairwise_distances(np.asarray(sparse_binary_data.todense()), metric=metric)
-     if metric in ("jaccard", "dice", "sokalsneath"):
-         dist_matrix[np.where(~np.isfinite(dist_matrix))] = 0.0
-     if metric in ("kulsinski", "russellrao"):


=====================================
debian/patches/series
=====================================
@@ -1,2 +1 @@
-scipy.patch
 arm.patch


=====================================
debian/rules
=====================================
@@ -4,7 +4,7 @@ export PYBUILD_NAME=pynndescent
 export PYTHONPATH=$(CURDIR)
 
 %:
-	dh $@ --with python3 --buildsystem=pybuild
+	dh $@ --buildsystem=pybuild
 
 override_dh_auto_test:
 ifeq (,$(filter nocheck,$(DEB_BUILD_OPTIONS)))


=====================================
pynndescent.egg-info/PKG-INFO
=====================================
@@ -1,6 +1,6 @@
-Metadata-Version: 1.2
+Metadata-Version: 2.1
 Name: pynndescent
-Version: 0.5.8
+Version: 0.5.11
 Summary: Nearest Neighbor Descent
 Home-page: http://github.com/lmcinnes/pynndescent
 Author: Leland McInnes
@@ -8,219 +8,11 @@ Author-email: leland.mcinnes at gmail.com
 Maintainer: Leland McInnes
 Maintainer-email: leland.mcinnes at gmail.com
 License: BSD
-Description: .. image:: doc/pynndescent_logo.png
-          :width: 600
-          :align: center
-          :alt: PyNNDescent Logo
-        
-        .. image:: https://dev.azure.com/lelandmcinnes/UMAP%20project%20builds/_apis/build/status/lmcinnes.pynndescent?branchName=master
-            :target: .. _build_status: https://dev.azure.com/lelandmcinnes/UMAP%20project%20builds/_build/latest?definitionId=2&branchName=master
-            :alt: Azure Pipelines Build Status
-        .. image:: https://img.shields.io/lgtm/alerts/g/lmcinnes/pynndescent.svg
-            :target: https://lgtm.com/projects/g/lmcinnes/pynndescent/alerts
-            :alt: LGTM Alerts
-        .. image:: https://img.shields.io/lgtm/grade/python/g/lmcinnes/pynndescent.svg
-            :target: https://lgtm.com/projects/g/lmcinnes/pynndescent/context:python
-            :alt: LGTM Grade
-        .. image:: https://readthedocs.org/projects/pynndescent/badge/?version=latest
-            :target: https://pynndescent.readthedocs.io/en/latest/?badge=latest
-            :alt: Documentation Status
-        
-        ===========
-        PyNNDescent
-        ===========
-        
-        PyNNDescent is a Python nearest neighbor descent for approximate nearest neighbors.
-        It provides a python implementation of Nearest Neighbor
-        Descent for k-neighbor-graph construction and approximate nearest neighbor
-        search, as per the paper:
-        
-        Dong, Wei, Charikar Moses, and Kai Li.
-        *"Efficient k-nearest neighbor graph construction for generic similarity
-        measures."*
-        Proceedings of the 20th international conference on World wide web. ACM, 2011.
-        
-        This library supplements that approach with the use of random projection trees for
-        initialisation. This can be particularly useful for the metrics that are
-        amenable to such approaches (euclidean, minkowski, angular, cosine, etc.). Graph
-        diversification is also performed, pruning the longest edges of any triangles in the
-        graph.
-        
-        Currently this library targets relatively high accuracy 
-        (80%-100% accuracy rate) approximate nearest neighbor searches.
-        
-        --------------------
-        Why use PyNNDescent?
-        --------------------
-        
-        PyNNDescent provides fast approximate nearest neighbor queries. The
-        `ann-benchmarks <https://github.com/erikbern/ann-benchmarks>`_ system puts it
-        solidly in the mix of top performing ANN libraries:
-        
-        **SIFT-128 Euclidean**
-        
-        .. image:: https://pynndescent.readthedocs.io/en/latest/_images/sift.png
-            :alt: ANN benchmark performance for SIFT 128 dataset
-        
-        **NYTimes-256 Angular**
-        
-        .. image:: https://pynndescent.readthedocs.io/en/latest/_images/nytimes.png
-            :alt: ANN benchmark performance for NYTimes 256 dataset
-        
-        While PyNNDescent is among fastest ANN library, it is also both easy to install (pip
-        and conda installable) with no platform or compilation issues, and is very flexible,
-        supporting a wide variety of distance metrics by default:
-        
-        **Minkowski style metrics**
-        
-        - euclidean
-        - manhattan
-        - chebyshev
-        - minkowski
-        
-        **Miscellaneous spatial metrics**
-        
-        - canberra
-        - braycurtis
-        - haversine
-        
-        **Normalized spatial metrics**
-        
-        - mahalanobis
-        - wminkowski
-        - seuclidean
-        
-        **Angular and correlation metrics**
-        
-        - cosine
-        - dot
-        - correlation
-        - spearmanr
-        - tsss
-        - true_angular
-        
-        **Probability metrics**
-        
-        - hellinger
-        - wasserstein
-        
-        **Metrics for binary data**
-        
-        - hamming
-        - jaccard
-        - dice
-        - russelrao
-        - kulsinski
-        - rogerstanimoto
-        - sokalmichener
-        - sokalsneath
-        - yule
-        
-        and also custom user defined distance metrics while still retaining performance.
-        
-        PyNNDescent also integrates well with Scikit-learn, including providing support
-        for the KNeighborTransformer as a drop in replacement for algorithms
-        that make use of nearest neighbor computations.
-        
-        ----------------------
-        How to use PyNNDescent
-        ----------------------
-        
-        PyNNDescent aims to have a very simple interface. It is similar to (but more
-        limited than) KDTrees and BallTrees in ``sklearn``. In practice there are
-        only two operations -- index construction, and querying an index for nearest
-        neighbors.
-        
-        To build a new search index on some training data ``data`` you can do something
-        like
-        
-        .. code:: python
-        
-            from pynndescent import NNDescent
-            index = NNDescent(data)
-        
-        You can then use the index for searching (and can pickle it to disk if you
-        wish). To search a pynndescent index for the 15 nearest neighbors of a test data
-        set ``query_data`` you can do something like
-        
-        .. code:: python
-        
-            index.query(query_data, k=15)
-        
-        and that is pretty much all there is to it. You can find more details in the
-        `documentation <https://pynndescent.readthedocs.org>`_.
-        
-        ----------
-        Installing
-        ----------
-        
-        PyNNDescent is designed to be easy to install being a pure python module with
-        relatively light requirements:
-        
-        * numpy
-        * scipy
-        * scikit-learn >= 0.22
-        * numba >= 0.51
-        
-        all of which should be pip or conda installable. The easiest way to install should be
-        via conda:
-        
-        .. code:: bash
-        
-            conda install -c conda-forge pynndescent
-        
-        or via pip:
-        
-        .. code:: bash
-        
-            pip install pynndescent
-        
-        To manually install this package:
-        
-        .. code:: bash
-        
-            wget https://github.com/lmcinnes/pynndescent/archive/master.zip
-            unzip master.zip
-            rm master.zip
-            cd pynndescent-master
-            python setup.py install
-        
-        ----------------
-        Help and Support
-        ----------------
-        
-        This project is still young. The documentation is still growing. In the meantime please
-        `open an issue <https://github.com/lmcinnes/pynndescent/issues/new>`_
-        and I will try to provide any help and guidance that I can. Please also check
-        the docstrings on the code, which provide some descriptions of the parameters.
-        
-        -------
-        License
-        -------
-        
-        The pynndescent package is 2-clause BSD licensed. Enjoy.
-        
-        ------------
-        Contributing
-        ------------
-        
-        Contributions are more than welcome! There are lots of opportunities
-        for potential projects, so please get in touch if you would like to
-        help out. Everything from code to notebooks to
-        examples and documentation are all *equally valuable* so please don't feel
-        you can't contribute. To contribute please `fork the project <https://github.com/lmcinnes/pynndescent/issues#fork-destination-box>`_ make your changes and
-        submit a pull request. We will do our best to work through any issues with
-        you and get your code merged into the main branch.
-        
-        
-        
 Keywords: nearest neighbor,knn,ANN
-Platform: UNKNOWN
 Classifier: Development Status :: 3 - Alpha
 Classifier: Intended Audience :: Science/Research
 Classifier: Intended Audience :: Developers
 Classifier: License :: OSI Approved
-Classifier: Programming Language :: C
 Classifier: Programming Language :: Python
 Classifier: Topic :: Software Development
 Classifier: Topic :: Scientific/Engineering
@@ -231,3 +23,204 @@ Classifier: Operating System :: MacOS
 Classifier: Programming Language :: Python :: 3.6
 Classifier: Programming Language :: Python :: 3.7
 Classifier: Programming Language :: Python :: 3.8
+License-File: LICENSE
+
+.. image:: doc/pynndescent_logo.png
+  :width: 600
+  :align: center
+  :alt: PyNNDescent Logo
+
+.. image:: https://dev.azure.com/TutteInstitute/build-pipelines/_apis/build/status%2Flmcinnes.pynndescent?branchName=master
+    :target: https://dev.azure.com/TutteInstitute/build-pipelines/_build?definitionId=17
+    :alt: Azure Pipelines Build Status
+.. image:: https://readthedocs.org/projects/pynndescent/badge/?version=latest
+    :target: https://pynndescent.readthedocs.io/en/latest/?badge=latest
+    :alt: Documentation Status
+
+===========
+PyNNDescent
+===========
+
+PyNNDescent is a Python nearest neighbor descent for approximate nearest neighbors.
+It provides a python implementation of Nearest Neighbor
+Descent for k-neighbor-graph construction and approximate nearest neighbor
+search, as per the paper:
+
+Dong, Wei, Charikar Moses, and Kai Li.
+*"Efficient k-nearest neighbor graph construction for generic similarity
+measures."*
+Proceedings of the 20th international conference on World wide web. ACM, 2011.
+
+This library supplements that approach with the use of random projection trees for
+initialisation. This can be particularly useful for the metrics that are
+amenable to such approaches (euclidean, minkowski, angular, cosine, etc.). Graph
+diversification is also performed, pruning the longest edges of any triangles in the
+graph.
+
+Currently this library targets relatively high accuracy 
+(80%-100% accuracy rate) approximate nearest neighbor searches.
+
+--------------------
+Why use PyNNDescent?
+--------------------
+
+PyNNDescent provides fast approximate nearest neighbor queries. The
+`ann-benchmarks <https://github.com/erikbern/ann-benchmarks>`_ system puts it
+solidly in the mix of top performing ANN libraries:
+
+**SIFT-128 Euclidean**
+
+.. image:: https://pynndescent.readthedocs.io/en/latest/_images/sift.png
+    :alt: ANN benchmark performance for SIFT 128 dataset
+
+**NYTimes-256 Angular**
+
+.. image:: https://pynndescent.readthedocs.io/en/latest/_images/nytimes.png
+    :alt: ANN benchmark performance for NYTimes 256 dataset
+
+While PyNNDescent is among fastest ANN library, it is also both easy to install (pip
+and conda installable) with no platform or compilation issues, and is very flexible,
+supporting a wide variety of distance metrics by default:
+
+**Minkowski style metrics**
+
+- euclidean
+- manhattan
+- chebyshev
+- minkowski
+
+**Miscellaneous spatial metrics**
+
+- canberra
+- braycurtis
+- haversine
+
+**Normalized spatial metrics**
+
+- mahalanobis
+- wminkowski
+- seuclidean
+
+**Angular and correlation metrics**
+
+- cosine
+- dot
+- correlation
+- spearmanr
+- tsss
+- true_angular
+
+**Probability metrics**
+
+- hellinger
+- wasserstein
+
+**Metrics for binary data**
+
+- hamming
+- jaccard
+- dice
+- russelrao
+- kulsinski
+- rogerstanimoto
+- sokalmichener
+- sokalsneath
+- yule
+
+and also custom user defined distance metrics while still retaining performance.
+
+PyNNDescent also integrates well with Scikit-learn, including providing support
+for the KNeighborTransformer as a drop in replacement for algorithms
+that make use of nearest neighbor computations.
+
+----------------------
+How to use PyNNDescent
+----------------------
+
+PyNNDescent aims to have a very simple interface. It is similar to (but more
+limited than) KDTrees and BallTrees in ``sklearn``. In practice there are
+only two operations -- index construction, and querying an index for nearest
+neighbors.
+
+To build a new search index on some training data ``data`` you can do something
+like
+
+.. code:: python
+
+    from pynndescent import NNDescent
+    index = NNDescent(data)
+
+You can then use the index for searching (and can pickle it to disk if you
+wish). To search a pynndescent index for the 15 nearest neighbors of a test data
+set ``query_data`` you can do something like
+
+.. code:: python
+
+    index.query(query_data, k=15)
+
+and that is pretty much all there is to it. You can find more details in the
+`documentation <https://pynndescent.readthedocs.org>`_.
+
+----------
+Installing
+----------
+
+PyNNDescent is designed to be easy to install being a pure python module with
+relatively light requirements:
+
+* numpy
+* scipy
+* scikit-learn >= 0.22
+* numba >= 0.51
+
+all of which should be pip or conda installable. The easiest way to install should be
+via conda:
+
+.. code:: bash
+
+    conda install -c conda-forge pynndescent
+
+or via pip:
+
+.. code:: bash
+
+    pip install pynndescent
+
+To manually install this package:
+
+.. code:: bash
+
+    wget https://github.com/lmcinnes/pynndescent/archive/master.zip
+    unzip master.zip
+    rm master.zip
+    cd pynndescent-master
+    python setup.py install
+
+----------------
+Help and Support
+----------------
+
+This project is still young. The documentation is still growing. In the meantime please
+`open an issue <https://github.com/lmcinnes/pynndescent/issues/new>`_
+and I will try to provide any help and guidance that I can. Please also check
+the docstrings on the code, which provide some descriptions of the parameters.
+
+-------
+License
+-------
+
+The pynndescent package is 2-clause BSD licensed. Enjoy.
+
+------------
+Contributing
+------------
+
+Contributions are more than welcome! There are lots of opportunities
+for potential projects, so please get in touch if you would like to
+help out. Everything from code to notebooks to
+examples and documentation are all *equally valuable* so please don't feel
+you can't contribute. To contribute please `fork the project <https://github.com/lmcinnes/pynndescent/issues#fork-destination-box>`_ make your changes and
+submit a pull request. We will do our best to work through any issues with
+you and get your code merged into the main branch.
+
+


=====================================
pynndescent.egg-info/SOURCES.txt
=====================================
@@ -26,4 +26,6 @@ pynndescent/tests/conftest.py
 pynndescent/tests/test_distances.py
 pynndescent/tests/test_pynndescent_.py
 pynndescent/tests/test_rank.py
-pynndescent/tests/test_data/cosine_hang.npy
\ No newline at end of file
+pynndescent/tests/test_data/cosine_hang.npy
+pynndescent/tests/test_data/cosine_near_duplicates.npy
+pynndescent/tests/test_data/pynndescent_bug_np.npz
\ No newline at end of file


=====================================
pynndescent/distances.py
=====================================
@@ -695,15 +695,10 @@ def rankdata(a, method="average"):
 
 @numba.njit(fastmath=True)
 def spearmanr(x, y):
-    a = np.column_stack((x, y))
+    x_rank = rankdata(x)
+    y_rank = rankdata(y)
 
-    n_vars = a.shape[1]
-
-    for i in range(n_vars):
-        a[:, i] = rankdata(a[:, i])
-    rs = np.corrcoef(a, rowvar=0)
-
-    return rs[1, 0]
+    return correlation(x_rank, y_rank)
 
 
 @numba.njit(nogil=True)


=====================================
pynndescent/pynndescent_.py
=====================================
@@ -93,7 +93,7 @@ def generate_leaf_updates(leaf_block, dist_thresholds, data, dist):
     return updates
 
 
- at numba.njit(locals={"d": numba.float32, "p": numba.int32, "q": numba.int32}, cache=True)
+ at numba.njit(locals={"d": numba.float32, "p": numba.int32, "q": numba.int32}, cache=False)
 def init_rp_tree(data, dist, current_graph, leaf_array):
 
     n_leaves = leaf_array.shape[0]
@@ -137,7 +137,7 @@ def init_rp_tree(data, dist, current_graph, leaf_array):
 @numba.njit(
     fastmath=True,
     locals={"d": numba.float32, "idx": numba.int32, "i": numba.int32},
-    cache=True,
+    cache=False,
 )
 def init_random(n_neighbors, data, heap, dist, rng_state):
     for i in range(data.shape[0]):
@@ -199,7 +199,7 @@ def generate_graph_updates(
     return updates
 
 
- at numba.njit(cache=True)
+ at numba.njit(cache=False)
 def process_candidates(
     data,
     dist,
@@ -630,6 +630,13 @@ class NNDescent:
         non-negligible computation cost in building the index. Don't tweak
         this value unless you know what you're doing.
 
+    max_rptree_depth: int (optional, default=100)
+        Maximum depth of random projection trees. Increasing this may result in a
+        richer, deeper random projection forest, but it may be composed of many
+        degenerate branches. Increase leaf_size in order to keep shallower, wider
+        nondegenerate trees. Such wide trees, however, may yield poor performance
+        of the preparation of the NN descent.
+
     n_iters: int (optional, default=None)
         The maximum number of NN-descent iterations to perform. The
         NN-descent algorithm can abort early if limited progress is being
@@ -679,6 +686,7 @@ class NNDescent:
         random_state=None,
         low_memory=True,
         max_candidates=None,
+        max_rptree_depth=200,
         n_iters=None,
         delta=0.001,
         n_jobs=None,
@@ -702,6 +710,7 @@ class NNDescent:
         self.prune_degree_multiplier = pruning_degree_multiplier
         self.diversify_prob = diversify_prob
         self.n_search_trees = n_search_trees
+        self.max_rptree_depth = max_rptree_depth
         self.max_candidates = max_candidates
         self.low_memory = low_memory
         self.n_iters = n_iters
@@ -776,6 +785,7 @@ class NNDescent:
 
         if metric == "dot":
             data = normalize(data, norm="l2", copy=copy_on_normalize)
+            self._raw_data = data
 
         self.rng_state = current_random_state.randint(INT32_MIN, INT32_MAX, 3).astype(
             np.int64
@@ -799,6 +809,7 @@ class NNDescent:
                 current_random_state,
                 self.n_jobs,
                 self._angular_trees,
+                max_depth=self.max_rptree_depth,
             )
             leaf_array = rptree_leaf_array(self._rp_forest)
         else:
@@ -988,6 +999,7 @@ class NNDescent:
                         current_random_state,
                         self.n_jobs,
                         self._angular_trees,
+                        max_depth=self.max_rptree_depth,
                     )
                     self._search_forest = [
                         convert_tree_format(
@@ -1838,6 +1850,7 @@ class NNDescent:
                 current_random_state,
                 self.n_jobs,
                 self._angular_trees,
+                max_depth=self.max_rptree_depth,
             )
             leaf_array = rptree_leaf_array(self._rp_forest)
             current_graph = make_heap(self._raw_data.shape[0], self.n_neighbors)


=====================================
pynndescent/rp_trees.py
=====================================
@@ -141,6 +141,18 @@ def angular_random_projection_split(data, indices, rng_state):
             side[i] = 1
             n_right += 1
 
+    # If all points end up on one side, something went wrong numerically
+    # In this case, assign points randomly; they are likely very close anyway
+    if n_left == 0 or n_right == 0:
+        n_left = 0
+        n_right = 0
+        for i in range(indices.shape[0]):
+            side[i] = tau_rand_int(rng_state) % 2
+            if side[i] == 0:
+                n_left += 1
+            else:
+                n_right += 1
+
     # Now that we have the counts allocate arrays
     indices_left = np.empty(n_left, dtype=np.int32)
     indices_right = np.empty(n_right, dtype=np.int32)
@@ -248,6 +260,18 @@ def euclidean_random_projection_split(data, indices, rng_state):
             side[i] = 1
             n_right += 1
 
+    # If all points end up on one side, something went wrong numerically
+    # In this case, assign points randomly; they are likely very close anyway
+    if n_left == 0 or n_right == 0:
+        n_left = 0
+        n_right = 0
+        for i in range(indices.shape[0]):
+            side[i] = tau_rand_int(rng_state) % 2
+            if side[i] == 0:
+                n_left += 1
+            else:
+                n_right += 1
+
     # Now that we have the counts allocate arrays
     indices_left = np.empty(n_left, dtype=np.int32)
     indices_right = np.empty(n_right, dtype=np.int32)
@@ -372,6 +396,18 @@ def sparse_angular_random_projection_split(inds, indptr, data, indices, rng_stat
             side[i] = 1
             n_right += 1
 
+    # If all points end up on one side, something went wrong numerically
+    # In this case, assign points randomly; they are likely very close anyway
+    if n_left == 0 or n_right == 0:
+        n_left = 0
+        n_right = 0
+        for i in range(indices.shape[0]):
+            side[i] = tau_rand_int(rng_state) % 2
+            if side[i] == 0:
+                n_left += 1
+            else:
+                n_right += 1
+
     # Now that we have the counts allocate arrays
     indices_left = np.empty(n_left, dtype=np.int32)
     indices_right = np.empty(n_right, dtype=np.int32)
@@ -479,6 +515,18 @@ def sparse_euclidean_random_projection_split(inds, indptr, data, indices, rng_st
             side[i] = 1
             n_right += 1
 
+    # If all points end up on one side, something went wrong numerically
+    # In this case, assign points randomly; they are likely very close anyway
+    if n_left == 0 or n_right == 0:
+        n_left = 0
+        n_right = 0
+        for i in range(indices.shape[0]):
+            side[i] = abs(tau_rand_int(rng_state)) % 2
+            if side[i] == 0:
+                n_left += 1
+            else:
+                n_right += 1
+
     # Now that we have the counts allocate arrays
     indices_left = np.empty(n_left, dtype=np.int32)
     indices_right = np.empty(n_right, dtype=np.int32)
@@ -501,7 +549,6 @@ def sparse_euclidean_random_projection_split(inds, indptr, data, indices, rng_st
 
 @numba.njit(
     nogil=True,
-    cache=True,
     locals={"left_node_num": numba.types.int32, "right_node_num": numba.types.int32},
 )
 def make_euclidean_tree(
@@ -513,8 +560,9 @@ def make_euclidean_tree(
     point_indices,
     rng_state,
     leaf_size=30,
+    max_depth=200,
 ):
-    if indices.shape[0] > leaf_size:
+    if indices.shape[0] > leaf_size and max_depth > 0:
         (
             left_indices,
             right_indices,
@@ -531,6 +579,7 @@ def make_euclidean_tree(
             point_indices,
             rng_state,
             leaf_size,
+            max_depth - 1,
         )
 
         left_node_num = len(point_indices) - 1
@@ -544,6 +593,7 @@ def make_euclidean_tree(
             point_indices,
             rng_state,
             leaf_size,
+            max_depth - 1,
         )
 
         right_node_num = len(point_indices) - 1
@@ -563,7 +613,6 @@ def make_euclidean_tree(
 
 @numba.njit(
     nogil=True,
-    cache=True,
     locals={
         "children": numba.types.ListType(children_type),
         "left_node_num": numba.types.int32,
@@ -579,8 +628,9 @@ def make_angular_tree(
     point_indices,
     rng_state,
     leaf_size=30,
+    max_depth=200,
 ):
-    if indices.shape[0] > leaf_size:
+    if indices.shape[0] > leaf_size and max_depth > 0:
         (
             left_indices,
             right_indices,
@@ -597,6 +647,7 @@ def make_angular_tree(
             point_indices,
             rng_state,
             leaf_size,
+            max_depth - 1,
         )
 
         left_node_num = len(point_indices) - 1
@@ -610,6 +661,7 @@ def make_angular_tree(
             point_indices,
             rng_state,
             leaf_size,
+            max_depth - 1,
         )
 
         right_node_num = len(point_indices) - 1
@@ -629,7 +681,6 @@ def make_angular_tree(
 
 @numba.njit(
     nogil=True,
-    cache=True,
     locals={"left_node_num": numba.types.int32, "right_node_num": numba.types.int32},
 )
 def make_sparse_euclidean_tree(
@@ -643,8 +694,9 @@ def make_sparse_euclidean_tree(
     point_indices,
     rng_state,
     leaf_size=30,
+    max_depth=200,
 ):
-    if indices.shape[0] > leaf_size:
+    if indices.shape[0] > leaf_size and max_depth > 0:
         (
             left_indices,
             right_indices,
@@ -665,6 +717,7 @@ def make_sparse_euclidean_tree(
             point_indices,
             rng_state,
             leaf_size,
+            max_depth - 1,
         )
 
         left_node_num = len(point_indices) - 1
@@ -680,6 +733,7 @@ def make_sparse_euclidean_tree(
             point_indices,
             rng_state,
             leaf_size,
+            max_depth - 1,
         )
 
         right_node_num = len(point_indices) - 1
@@ -699,7 +753,6 @@ def make_sparse_euclidean_tree(
 
 @numba.njit(
     nogil=True,
-    cache=True,
     locals={"left_node_num": numba.types.int32, "right_node_num": numba.types.int32},
 )
 def make_sparse_angular_tree(
@@ -713,8 +766,9 @@ def make_sparse_angular_tree(
     point_indices,
     rng_state,
     leaf_size=30,
+    max_depth=200,
 ):
-    if indices.shape[0] > leaf_size:
+    if indices.shape[0] > leaf_size and max_depth > 0:
         (
             left_indices,
             right_indices,
@@ -735,6 +789,7 @@ def make_sparse_angular_tree(
             point_indices,
             rng_state,
             leaf_size,
+            max_depth - 1,
         )
 
         left_node_num = len(point_indices) - 1
@@ -750,6 +805,7 @@ def make_sparse_angular_tree(
             point_indices,
             rng_state,
             leaf_size,
+            max_depth - 1,
         )
 
         right_node_num = len(point_indices) - 1
@@ -765,8 +821,8 @@ def make_sparse_angular_tree(
         point_indices.append(indices)
 
 
- at numba.njit(nogil=True, cache=True)
-def make_dense_tree(data, rng_state, leaf_size=30, angular=False):
+ at numba.njit(nogil=True)
+def make_dense_tree(data, rng_state, leaf_size=30, angular=False, max_depth=200):
     indices = np.arange(data.shape[0]).astype(np.int32)
 
     hyperplanes = numba.typed.List.empty_list(dense_hyperplane_type)
@@ -784,6 +840,7 @@ def make_dense_tree(data, rng_state, leaf_size=30, angular=False):
             point_indices,
             rng_state,
             leaf_size,
+            max_depth=max_depth,
         )
     else:
         make_euclidean_tree(
@@ -795,14 +852,28 @@ def make_dense_tree(data, rng_state, leaf_size=30, angular=False):
             point_indices,
             rng_state,
             leaf_size,
+            max_depth=max_depth,
         )
 
-    result = FlatTree(hyperplanes, offsets, children, point_indices, leaf_size)
+    max_leaf_size = leaf_size
+    for points in point_indices:
+        if len(points) > max_leaf_size:
+            max_leaf_size = numba.int32(len(points))
+
+    result = FlatTree(hyperplanes, offsets, children, point_indices, max_leaf_size)
     return result
 
 
- at numba.njit(nogil=True, cache=True)
-def make_sparse_tree(inds, indptr, spdata, rng_state, leaf_size=30, angular=False):
+ at numba.njit(nogil=True)
+def make_sparse_tree(
+    inds,
+    indptr,
+    spdata,
+    rng_state,
+    leaf_size=30,
+    angular=False,
+    max_depth=200,
+):
     indices = np.arange(indptr.shape[0] - 1).astype(np.int32)
 
     hyperplanes = numba.typed.List.empty_list(sparse_hyperplane_type)
@@ -822,6 +893,7 @@ def make_sparse_tree(inds, indptr, spdata, rng_state, leaf_size=30, angular=Fals
             point_indices,
             rng_state,
             leaf_size,
+            max_depth=max_depth,
         )
     else:
         make_sparse_euclidean_tree(
@@ -835,9 +907,15 @@ def make_sparse_tree(inds, indptr, spdata, rng_state, leaf_size=30, angular=Fals
             point_indices,
             rng_state,
             leaf_size,
+            max_depth=max_depth,
         )
 
-    return FlatTree(hyperplanes, offsets, children, point_indices, leaf_size)
+    max_leaf_size = leaf_size
+    for points in point_indices:
+        if len(points) > max_leaf_size:
+            max_leaf_size = numba.int32(len(points))
+
+    return FlatTree(hyperplanes, offsets, children, point_indices, max_leaf_size)
 
 
 @numba.njit(
@@ -956,6 +1034,7 @@ def make_forest(
     random_state,
     n_jobs=None,
     angular=False,
+    max_depth=200,
 ):
     """Build a random projection forest with ``n_trees``.
 
@@ -993,12 +1072,19 @@ def make_forest(
                     rng_states[i],
                     leaf_size,
                     angular,
+                    max_depth=max_depth,
                 )
                 for i in range(n_trees)
             )
         else:
             result = joblib.Parallel(n_jobs=n_jobs, require="sharedmem")(
-                joblib.delayed(make_dense_tree)(data, rng_states[i], leaf_size, angular)
+                joblib.delayed(make_dense_tree)(
+                    data,
+                    rng_states[i],
+                    leaf_size,
+                    angular,
+                    max_depth=max_depth
+                )
                 for i in range(n_trees)
             )
     except (RuntimeError, RecursionError, SystemError):
@@ -1011,14 +1097,14 @@ def make_forest(
     return tuple(result)
 
 
- at numba.njit(nogil=True, cache=True)
-def get_leaves_from_tree(tree):
+ at numba.njit(nogil=True)
+def get_leaves_from_tree(tree, max_leaf_size):
     n_leaves = 0
     for i in range(len(tree.children)):
         if tree.children[i][0] == -1 and tree.children[i][1] == -1:
             n_leaves += 1
 
-    result = np.full((n_leaves, tree.leaf_size), -1, dtype=np.int32)
+    result = np.full((n_leaves, max_leaf_size), -1, dtype=np.int32)
     leaf_index = 0
     for i in range(len(tree.indices)):
         if tree.children[i][0] == -1 or tree.children[i][1] == -1:
@@ -1030,8 +1116,9 @@ def get_leaves_from_tree(tree):
 
 
 def rptree_leaf_array_parallel(rp_forest):
+    max_leaf_size = np.max([rp_tree.leaf_size for rp_tree in rp_forest])
     result = joblib.Parallel(n_jobs=-1, require="sharedmem")(
-        joblib.delayed(get_leaves_from_tree)(rp_tree) for rp_tree in rp_forest
+        joblib.delayed(get_leaves_from_tree)(rp_tree, max_leaf_size) for rp_tree in rp_forest
     )
     return result
 
@@ -1047,7 +1134,6 @@ def rptree_leaf_array(rp_forest):
 def recursive_convert(
     tree, hyperplanes, offsets, children, indices, node_num, leaf_start, tree_node
 ):
-
     if tree.children[tree_node][0] < 0:
         leaf_end = leaf_start + len(tree.indices[tree_node])
         children[node_num, 0] = -leaf_start
@@ -1087,7 +1173,6 @@ def recursive_convert(
 def recursive_convert_sparse(
     tree, hyperplanes, offsets, children, indices, node_num, leaf_start, tree_node
 ):
-
     if tree.children[tree_node][0] < 0:
         leaf_end = leaf_start + len(tree.indices[tree_node])
         children[node_num, 0] = -leaf_start
@@ -1176,7 +1261,6 @@ FLAT_TREE_LEAF_SIZE = 4
 
 
 def denumbaify_tree(tree):
-
     result = (
         tree.hyperplanes,
         tree.offsets,
@@ -1189,7 +1273,6 @@ def denumbaify_tree(tree):
 
 
 def renumbaify_tree(tree):
-
     result = FlatTree(
         tree[FLAT_TREE_HYPERPLANES],
         tree[FLAT_TREE_OFFSETS],


=====================================
pynndescent/sparse_nndescent.py
=====================================
@@ -53,7 +53,7 @@ def generate_leaf_updates(leaf_block, dist_thresholds, inds, indptr, data, dist)
     return updates
 
 
- at numba.njit(locals={"d": numba.float32, "p": numba.int32, "q": numba.int32}, cache=True)
+ at numba.njit(locals={"d": numba.float32, "p": numba.int32, "q": numba.int32}, cache=False)
 def init_rp_tree(inds, indptr, data, dist, current_graph, leaf_array):
 
     n_leaves = leaf_array.shape[0]
@@ -99,7 +99,7 @@ def init_rp_tree(inds, indptr, data, dist, current_graph, leaf_array):
 @numba.njit(
     fastmath=True,
     locals={"d": numba.float32, "i": numba.int32, "idx": numba.int32},
-    cache=True,
+    cache=False,
 )
 def init_random(n_neighbors, inds, indptr, data, heap, dist, rng_state):
     n_samples = indptr.shape[0] - 1


=====================================
pynndescent/tests/conftest.py
=====================================
@@ -63,6 +63,13 @@ def cosine_hang_data():
     return np.load(data_path)
 
 
+ at pytest.fixture
+def cosine_near_duplicates_data():
+    this_dir = os.path.dirname(os.path.abspath(__file__))
+    data_path = os.path.join(this_dir, "test_data/cosine_near_duplicates.npy")
+    return np.load(data_path)
+
+
 @pytest.fixture
 def small_data():
     return np.random.uniform(40, 5, size=(20, 5))


=====================================
pynndescent/tests/test_data/cosine_near_duplicates.npy
=====================================
Binary files /dev/null and b/pynndescent/tests/test_data/cosine_near_duplicates.npy differ


=====================================
pynndescent/tests/test_data/pynndescent_bug_np.npz
=====================================
Binary files /dev/null and b/pynndescent/tests/test_data/pynndescent_bug_np.npz differ


=====================================
pynndescent/tests/test_distances.py
=====================================
@@ -58,7 +58,6 @@ def test_spatial_check(spatial_data, metric):
         "jaccard",
         "matching",
         "dice",
-        "kulsinski",
         "rogerstanimoto",
         "russellrao",
         "sokalmichener",
@@ -70,7 +69,7 @@ def test_binary_check(binary_data, metric):
     dist_matrix = pairwise_distances(binary_data, metric=metric)
     if metric in ("jaccard", "dice", "sokalsneath", "yule"):
         dist_matrix[np.where(~np.isfinite(dist_matrix))] = 0.0
-    if metric in ("kulsinski", "russellrao"):
+    if metric == "russellrao":
         dist_matrix[np.where(~np.isfinite(dist_matrix))] = 0.0
         # And because distance between all zero vectors should be zero
         dist_matrix[10, 11] = 0.0
@@ -109,11 +108,11 @@ def test_binary_check(binary_data, metric):
 def test_sparse_spatial_check(sparse_spatial_data, metric, decimal=6):
     if metric in spdist.sparse_named_distances:
         dist_matrix = pairwise_distances(
-            sparse_spatial_data.todense().astype(np.float32), metric=metric
+            np.asarray(sparse_spatial_data.todense()).astype(np.float32), metric=metric
         )
     if metric in ("braycurtis", "dice", "sokalsneath", "yule"):
         dist_matrix[np.where(~np.isfinite(dist_matrix))] = 0.0
-    if metric in ("cosine", "correlation", "kulsinski", "russellrao"):
+    if metric in ("cosine", "correlation", "russellrao"):
         dist_matrix[np.where(~np.isfinite(dist_matrix))] = 1.0
         # And because distance between all zero vectors should be zero
         dist_matrix[10, 11] = 0.0
@@ -165,7 +164,6 @@ def test_sparse_spatial_check(sparse_spatial_data, metric, decimal=6):
         "jaccard",
         "matching",
         "dice",
-        "kulsinski",
         "rogerstanimoto",
         "russellrao",
         "sokalmichener",
@@ -174,10 +172,10 @@ def test_sparse_spatial_check(sparse_spatial_data, metric, decimal=6):
 )
 def test_sparse_binary_check(sparse_binary_data, metric):
     if metric in spdist.sparse_named_distances:
-        dist_matrix = pairwise_distances(sparse_binary_data.todense(), metric=metric)
+        dist_matrix = pairwise_distances(np.asarray(sparse_binary_data.todense()), metric=metric)
     if metric in ("jaccard", "dice", "sokalsneath"):
         dist_matrix[np.where(~np.isfinite(dist_matrix))] = 0.0
-    if metric in ("kulsinski", "russellrao"):
+    if metric == "russellrao":
         dist_matrix[np.where(~np.isfinite(dist_matrix))] = 1.0
         # And because distance between all zero vectors should be zero
         dist_matrix[10, 11] = 0.0
@@ -309,7 +307,7 @@ def test_spearmanr():
 
     scipy_expected = stats.spearmanr(x, y)
     r = dist.spearmanr(x, y)
-    assert_array_almost_equal(r, scipy_expected.correlation)
+    assert_array_almost_equal(r, 1 - scipy_expected.correlation)
 
 
 def test_alternative_distances():


=====================================
pynndescent/tests/test_pynndescent_.py
=====================================
@@ -287,6 +287,24 @@ def test_deduplicated_data_behaves_normally(seed, cosine_hang_data):
     ), "NN-descent did not get 95% accuracy on nearest neighbors"
 
 
+def test_rp_trees_should_not_stack_overflow_with_near_duplicate_data(seed, cosine_near_duplicates_data):
+
+    n_neighbors = 10
+    knn_indices, _ = NNDescent(
+        cosine_near_duplicates_data,
+        "cosine",
+        {},
+        n_neighbors,
+        random_state=np.random.RandomState(seed),
+        n_trees=20,
+    )._neighbor_graph
+
+    for i in range(cosine_near_duplicates_data.shape[0]):
+        assert len(knn_indices[i]) == len(
+            np.unique(knn_indices[i])
+        ), "Duplicate graph_indices in knn graph"
+
+
 def test_output_when_verbose_is_true(spatial_data, seed):
     out = io.StringIO()
     with redirect_stdout(out):
@@ -663,3 +681,8 @@ def test_tree_no_split(small_data, sparse_small_data, metric):
         ), "NN-descent query did not get 95% for accuracy on nearest neighbors on {} data".format(
             data_type
         )
+
+ at pytest.mark.skipif('NUMBA_DISABLE_JIT' in os.environ, reason="Too expensive for disabled Numba")
+def test_bad_data():
+    data = np.sqrt(np.load("pynndescent/tests/test_data/pynndescent_bug_np.npz")['arr_0'])
+    index = NNDescent(data, metric="cosine")


=====================================
pynndescent/utils.py
=====================================
@@ -676,7 +676,7 @@ def apply_graph_updates_high_memory(current_graph, updates, in_graph):
     return n_changes
 
 
- at numba.njit(cache=True)
+ at numba.njit(cache=False)
 def initalize_heap_from_graph_indices(heap, graph_indices, data, metric):
 
     for i in range(graph_indices.shape[0]):


=====================================
setup.py
=====================================
@@ -8,7 +8,7 @@ def readme():
 
 configuration = {
     "name": "pynndescent",
-    "version": "0.5.8",
+    "version": "0.5.11",
     "description": "Nearest Neighbor Descent",
     "long_description": readme(),
     "classifiers": [
@@ -16,7 +16,6 @@ configuration = {
         "Intended Audience :: Science/Research",
         "Intended Audience :: Developers",
         "License :: OSI Approved",
-        "Programming Language :: C",
         "Programming Language :: Python",
         "Topic :: Software Development",
         "Topic :: Scientific/Engineering",



View it on GitLab: https://salsa.debian.org/python-team/packages/python-pynndescent/-/compare/d6ce6292c883607cd6aebe8537436f5708106c48...b972a7f420e3ebcc0186eabd61dd5b4461183b63

-- 
View it on GitLab: https://salsa.debian.org/python-team/packages/python-pynndescent/-/compare/d6ce6292c883607cd6aebe8537436f5708106c48...b972a7f420e3ebcc0186eabd61dd5b4461183b63
You're receiving this email because of your account on salsa.debian.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20231214/1c8e2fdb/attachment-0001.htm>


More information about the debian-med-commit mailing list