[Python-modules-commits] [python-cluster] 01/01: new release 1.3.1
Adrian Alves
alvesadrian-guest at moszumanska.debian.org
Wed Apr 20 12:19:24 UTC 2016
This is an automated email from the git hooks/post-receive script.
alvesadrian-guest pushed a commit to branch release_1.3.1
in repository python-cluster.
commit 6327860f79fdaa3838f04f6efa0fb9c41ad2efb7
Author: Adrian Alves <adrian.alves at codeenigma.com>
Date: Wed Mar 30 20:44:59 2016 -0300
new release 1.3.1
---
AUTHORS | 2 +
CHANGELOG | 10 +
INSTALL | 27 +
MANIFEST.in | 4 +-
PKG-INFO | 53 --
README => README.rst | 29 +-
cluster.py | 739 ----------------------------
cluster/__init__.py | 25 +
cluster/cluster.py | 162 ++++++
cluster/linkage.py | 100 ++++
cluster/matrix.py | 171 +++++++
cluster/method/__init__.py | 17 +
cluster/method/base.py | 69 +++
cluster/method/hierarchical.py | 208 ++++++++
cluster/method/kmeans.py | 168 +++++++
cluster/test/test_hierarchical.py | 249 ++++++++++
cluster/test/test_kmeans.py | 141 ++++++
cluster/test/test_linkage.py | 31 ++
cluster/test/test_numpy.py | 36 ++
cluster/util.py | 132 +++++
cluster/version.txt | 1 +
clusterTests.py | 190 -------
debian/changelog | 6 +-
debian/control | 4 +-
debian/watch | 2 +-
docs/Makefile | 177 +++++++
docs/apidoc/cluster.matrix.rst | 7 +
docs/apidoc/cluster.method.base.rst | 7 +
docs/apidoc/cluster.method.hierarchical.rst | 7 +
docs/apidoc/cluster.method.kmeans.rst | 7 +
docs/apidoc/cluster.rst | 7 +
docs/apidoc/cluster.util.rst | 7 +
docs/changelog.rst | 13 +
docs/conf.py | 260 ++++++++++
docs/index.rst | 112 +++++
fabfile.py | 10 +
makedist.sh | 8 +
pytest.ini | 2 +
setup.cfg | 4 +-
setup.py | 49 +-
40 files changed, 2238 insertions(+), 1015 deletions(-)
diff --git a/AUTHORS b/AUTHORS
new file mode 100644
index 0000000..5fb6b96
--- /dev/null
+++ b/AUTHORS
@@ -0,0 +1,2 @@
+Michel Albert (exhuma at users.sourceforge.net)
+Sam Sandberg (@LoisaidaSam)
\ No newline at end of file
diff --git a/CHANGELOG b/CHANGELOG
index 303c116..03ba52c 100644
--- a/CHANGELOG
+++ b/CHANGELOG
@@ -1,3 +1,13 @@
+1.2.1
+ - Fixed an issue in multiprocessing code.
+
+1.2.0
+ - Multiprocessing (by loisaidasam)
+ - Python 3 support
+ - Split up one big file into smaller more logical sub-modules
+ - Fixed https://github.com/exhuma/python-cluster/issues/11
+ - Documentation update.
+
1.1.1b3
- Fixed bug #1727558
- Some more unit-tests
diff --git a/INSTALL b/INSTALL
new file mode 100644
index 0000000..060219b
--- /dev/null
+++ b/INSTALL
@@ -0,0 +1,27 @@
+INSTALLATION
+============
+
+Simply run::
+
+ pip install cluster
+
+Or, if you run it in a virtualenv:
+
+ /path/to/your/env/bin/pip install cluster
+
+
+Source installation
+~~~~~~~~~~~~~~~~~~~
+
+Untar the archive::
+
+ tar xf <filename.tar.gz>
+
+Next, go to the folder just created. It will have the same name as the package
+(for example "cluster-1.2.2") and run::
+
+ python setup.py install
+
+This will require superuser privileges unless you install it in a virtual environment::
+
+ /path/to/your/env/bin/python setup.py install
diff --git a/MANIFEST.in b/MANIFEST.in
index c8d686f..2f813f4 100644
--- a/MANIFEST.in
+++ b/MANIFEST.in
@@ -1,2 +1,2 @@
-include README LICENSE CHANGELOG
-include *.py cluster.bmp MANIFEST.in
+include README.rst LICENSE CHANGELOG
+include cluster.bmp
diff --git a/PKG-INFO b/PKG-INFO
deleted file mode 100644
index 04865b2..0000000
--- a/PKG-INFO
+++ /dev/null
@@ -1,53 +0,0 @@
-Metadata-Version: 1.0
-Name: cluster
-Version: 1.1.1b3
-Summary: python-cluster is a "simple" package that allows to create several groups
-(clusters) of objects from a list
-Home-page: http://python-cluster.sourceforge.net/
-Author: Michel Albert
-Author-email: exhuma at users.sourceforge.net
-License: LGPL
-Description: DESCRIPTION
- ===========
-
- python-cluster is a "simple" package that allows to create several groups
- (clusters) of objects from a list. It's meant to be flexible and able to
- cluster any object. To ensure this kind of flexibility, you need not only to
- supply the list of objects, but also a function that calculates the similarity
- between two of those objects. For simple datatypes, like integers, this can be
- as simple as a subtraction, but more complex calculations are possible. Right
- now, it is possible to generate the clusters using a hierarchical clustering
- and the popular K-Means algorithm. For the hierarchical algorithm there are
- different "linkage" (single, complete, average and uclus) methods available. I
- plan to implement other algoithms as well on an
- "as-needed" or "as-I-have-time" basis.
-
- Algorithms are based on the document found at
- http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/
-
- USAGE
- =====
-
- A simple python program could look like this::
-
- >>> from cluster import *
- >>> data = [12,34,23,32,46,96,13]
- >>> cl = HierarchicalClustering(data, lambda x,y: abs(x-y))
- >>> cl.getlevel(10) # get clusters of items closer than 10
- [96, 46, [12, 13, 23, 34, 32]]
- >>> cl.getlevel(5) # get clusters of items closer than 5
- [96, 46, [12, 13], 23, [34, 32]]
-
- Note, that when you retrieve a set of clusters, it immediately starts the
- clustering process, which is quite complex. If you intend to create clusters
- from a large dataset, consider doing that in a separate thread.
-
- For K-Means clustering it would look like this:
-
- >>> from cluster import KMeansClustering
- >>> cl = KMeansClustering([(1,1), (2,1), (5,3), ...])
- >>> clusters = cl.getclusters(2)
-
- The parameter passed to getclusters is the count of clusters generated.
-
-Platform: UNKNOWN
diff --git a/README b/README.rst
similarity index 64%
rename from README
rename to README.rst
index 16d2218..f2e7c93 100644
--- a/README
+++ b/README.rst
@@ -1,6 +1,10 @@
DESCRIPTION
===========
+.. image:: https://readthedocs.org/projects/python-cluster/badge/?version=latest
+ :target: http://python-cluster.readthedocs.org
+ :alt: Documentation Status
+
python-cluster is a "simple" package that allows to create several groups
(clusters) of objects from a list. It's meant to be flexible and able to
cluster any object. To ensure this kind of flexibility, you need not only to
@@ -9,19 +13,23 @@ between two of those objects. For simple datatypes, like integers, this can be
as simple as a subtraction, but more complex calculations are possible. Right
now, it is possible to generate the clusters using a hierarchical clustering
and the popular K-Means algorithm. For the hierarchical algorithm there are
-different "linkage" (single, complete, average and uclus) methods available. I
-plan to implement other algoithms as well on an
-"as-needed" or "as-I-have-time" basis.
+different "linkage" (single, complete, average and uclus) methods available.
Algorithms are based on the document found at
http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/
+.. note::
+ The above site is no longer avaialble, but you can still view it in the
+ internet archive at:
+ https://web.archive.org/web/20070912040206/http://home.dei.polimi.it//matteucc/Clustering/tutorial_html/
+
+
USAGE
=====
A simple python program could look like this::
- >>> from cluster import *
+ >>> from cluster import HierarchicalClustering
>>> data = [12,34,23,32,46,96,13]
>>> cl = HierarchicalClustering(data, lambda x,y: abs(x-y))
>>> cl.getlevel(10) # get clusters of items closer than 10
@@ -33,10 +41,15 @@ Note, that when you retrieve a set of clusters, it immediately starts the
clustering process, which is quite complex. If you intend to create clusters
from a large dataset, consider doing that in a separate thread.
-For K-Means clustering it would look like this:
+For K-Means clustering it would look like this::
- >>> from cluster import KMeansClustering
- >>> cl = KMeansClustering([(1,1), (2,1), (5,3), ...])
- >>> clusters = cl.getclusters(2)
+ >>> from cluster import KMeansClustering
+ >>> cl = KMeansClustering([(1,1), (2,1), (5,3), ...])
+ >>> clusters = cl.getclusters(2)
The parameter passed to getclusters is the count of clusters generated.
+
+
+.. image:: https://readthedocs.org/projects/python-cluster/badge/?version=latest
+ :target: http://python-cluster.readthedocs.org
+ :alt: Documentation Status
diff --git a/cluster.py b/cluster.py
deleted file mode 100644
index a3ec51f..0000000
--- a/cluster.py
+++ /dev/null
@@ -1,739 +0,0 @@
-#
-# This is part of "python-cluster". A library to group similar items together.
-# Copyright (C) 2006 Michel Albert
-#
-# This library is free software; you can redistribute it and/or modify it under
-# the terms of the GNU Lesser General Public License as published by the Free
-# Software Foundation; either version 2.1 of the License, or (at your option)
-# any later version.
-# This library is distributed in the hope that it will be useful, but WITHOUT
-# ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
-# FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more
-# details.
-# You should have received a copy of the GNU Lesser General Public License
-# along with this library; if not, write to the Free Software Foundation, Inc.,
-# 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
-#
-
-from types import TupleType
-
-class ClusteringError(Exception):
- pass
-
-def flatten(L):
- """
- Flattens a list.
- Example:
- flatten([a,b,[c,d,[e,f]]]) = [a,b,c,d,e,f]
- """
- if type(L) != type([]): return [L]
- if L == []: return L
- return flatten(L[0]) + flatten(L[1:])
-
-def median(numbers):
- """Return the median of the list of numbers.
-
- found at: http://mail.python.org/pipermail/python-list/2004-December/253517.html"""
- # Sort the list and take the middle element.
- n = len(numbers)
- copy = numbers[:] # So that "numbers" keeps its original order
- copy.sort()
- if n & 1: # There is an odd number of elements
- return copy[n // 2]
- else:
- return (copy[n // 2 - 1] + copy[n // 2]) / 2.0
-
-def mean(numbers):
- """Returns the arithmetic mean of a numeric list.
-
- found at: http://mail.python.org/pipermail/python-list/2004-December/253517.html"""
- return float(sum(numbers)) / float(len(numbers))
-
-def minkowski_distance(x, y, p=2):
- """
- Calculates the minkowski distance between two points.
-
- PARAMETERS
- x - the first point
- y - the second point
- p - the order of the minkowski algorithm.
- Default = 2. This is equal to the euclidian distance.
- If the order is 1, it is equal to the manhatten
- distance.
- The higher the order, the closer it converges to the
- Chebyshev distance, which has p=infinity
- """
- from math import pow
- assert(len(y)==len(x))
- assert(x>=1)
- sum = 0
- for i in range(len(x)):
- sum += abs(x[i]-y[i]) ** p
- return pow(sum, 1.0/float(p))
-
-def genmatrix(list, combinfunc, symmetric=False, diagonal=None):
- """
- Takes a list and generates a 2D-matrix using the supplied combination
- function to calculate the values.
-
- PARAMETERS
- list - the list of items
- combinfunc - the function that is used to calculate teh value in a cell.
- It has to cope with two arguments.
- symmetric - Whether it will be a symmetric matrix along the diagonal.
- For example, it the list contains integers, and the
- combination function is abs(x-y), then the matrix will be
- symmetric.
- Default: False
- diagonal - The value to be put into the diagonal. For some functions,
- the diagonal will stay constant. An example could be the
- function "x-y". Then each diagonal cell will be "0".
- If this value is set to None, then the diagonal will be
- calculated.
- Default: None
- """
- matrix = []
- row_index = 0
- for item in list:
- row = []
- col_index = 0
- for item2 in list:
- if diagonal is not None and col_index == row_index:
- # if this is a cell on the diagonal
- row.append(diagonal)
- elif symmetric and col_index < row_index:
- # if the matrix is symmetric and we are "in the lower left triangle"
- row.append( matrix[col_index][row_index] )
- else:
- # if this cell is not on the diagonal
- row.append(combinfunc(item, item2))
- col_index += 1
- matrix.append(row)
- row_index += 1
- return matrix
-
-def printmatrix(list):
- """
- Prints out a 2-dimensional list cleanly.
- This is useful for debugging.
-
- PARAMETERS
- list - the 2D-list to display
- """
- # determine maximum length
- maxlen = 0
- colcount = len(list[0])
- for col in list:
- for cell in col:
- maxlen = max(len(str(cell)), maxlen)
- # print data
- format = " %%%is |" % maxlen
- format = "|" + format*colcount
- for row in list:
- print format % tuple(row)
-
-def magnitude(a):
- "calculates the magnitude of a vecor"
- from math import sqrt
- sum = 0
- for coord in a:
- sum += coord ** 2
- return sqrt(sum)
-
-def dotproduct(a, b):
- "Calculates the dotproduct between two vecors"
- assert(len(a) == len(b))
- out = 0
- for i in range(len(a)):
- out += a[i]*b[i]
- return out
-
-def centroid(list, method=median):
- "returns the central vector of a list of vectors"
- out = []
- for i in range(len(list[0])):
- out.append( method( [x[i] for x in list] ) )
- return tuple(out)
-
-class Cluster:
- """
- A collection of items. This is internally used to detect clustered items in
- the data so we could distinguish other collection types (lists, dicts, ...)
- from the actual clusters. This means that you could also create clusters of
- lists with this class.
- """
-
- def __str__(self):
- return "<Cluster@%s(%s)>" % (self.__level, self.__items)
-
- def __repr__(self):
- return self.__str__()
-
- def __init__(self, level, *args):
- """
- Constructor
-
- PARAMETERS
- level - The level of this cluster. This is used in hierarchical
- clustering to retrieve a specific set of clusters. The higher
- the level, the smaller the count of clusters returned. The
- level depends on the difference function used.
- *args - every additional argument passed following the level value
- will get added as item to the cluster. You could also pass a
- list as second parameter to initialise the cluster with that
- list as content
- """
- self.__level = level
- if len(args) == 0: self.__items = []
- else: self.__items = list(args)
-
- def append(self, item):
- """
- Appends a new item to the cluster
-
- PARAMETERS
- item - The item that is to be appended
- """
- self.__items.append(item)
-
- def items(self, newItems = None):
- """
- Sets or gets the items of the cluster
-
- PARAMETERS
- newItems (optional) - if set, the items of the cluster will be
- replaced with that argument.
- """
- if newItems is None: return self.__items
- else: self.__items = newItems
-
- def fullyflatten(self, *args):
- """
- Completely flattens out this cluster and returns a one-dimensional list
- containing the cluster's items. This is useful in cases where some items
- of the cluster are clusters in their own right and you only want the
- items.
-
- PARAMETERS
- *args - only used for recursion.
- """
- flattened_items = []
- if len(args) == 0: collection = self.__items
- else: collection = args[0].items()
-
- for item in collection:
- if isinstance(item, Cluster):
- flattened_items = flattened_items + self.fullyflatten(item)
- else:
- flattened_items.append(item)
-
- return flattened_items
-
- def level(self):
- """
- Returns the level associated with this cluster
- """
- return self.__level
-
- def display(self, depth=0):
- """
- Pretty-prints this cluster. Useful for debuging
- """
- print depth*" " + "[level %s]" % self.__level
- for item in self.__items:
- if isinstance(item, Cluster):
- item.display(depth+1)
- else:
- print depth*" "+"%s" % item
-
- def topology(self):
- """
- Returns the structure (topology) of the cluster as tuples.
-
- Output from cl.data:
- [<Cluster at 0.833333333333(['CVS', <Cluster at 0.818181818182(['34.xls',
- <Cluster at 0.789473684211([<Cluster at 0.555555555556(['0.txt',
- <Cluster at 0.181818181818(['ChangeLog', 'ChangeLog.txt'])>])>,
- <Cluster at 0.684210526316(['20060730.py',
- <Cluster at 0.684210526316(['.cvsignore',
- <Cluster at 0.647058823529(['About.py',
- <Cluster at 0.625(['.idlerc', '.pylint.d'])>])>])>])>])>])>])>]
-
- Corresponding output from cl.topo():
- ('CVS', ('34.xls', (('0.txt', ('ChangeLog', 'ChangeLog.txt')),
- ('20060730.py', ('.cvsignore', ('About.py',
- ('.idlerc', '.pylint.d')))))))
- """
-
- left = self.__items[0]
- right = self.__items[1]
- if isinstance(left, Cluster):
- first = left.topology()
- else:
- first = left
- if isinstance(right, Cluster):
- second = right.topology()
- else:
- second = right
- return first, second
-
- def getlevel(self, threshold):
- """
- Retrieve all clusters up to a specific level threshold. This
- level-threshold represents the maximum distance between two clusters. So
- the lower you set this threshold, the more clusters you will receive and
- the higher you set it, you will receive less but bigger clusters.
-
- PARAMETERS
- threshold - The level threshold
-
- NOTE
- It is debatable whether the value passed into this method should
- really be as strongly linked to the real cluster-levels as it is right
- now. The end-user will not know the range of this value unless s/he
- first inspects the top-level cluster. So instead you might argue that
- a value ranging from 0 to 1 might be a more useful approach.
- """
-
- left = self.__items[0]
- right = self.__items[1]
-
- # if this object itself is below the threshold value we only need to
- # return it's contents as a list
- if self.level() <= threshold:
- return [self.fullyflatten()]
-
- # if this cluster's level is higher than the threshold we will investgate
- # it's left and right part. Their level could be below the threshold
- if isinstance(left, Cluster) and left.level() <= threshold:
- if isinstance(right, Cluster):
- return [left.fullyflatten()] + right.getlevel(threshold)
- else:
- return [left.fullyflatten()] + [[right]]
- elif isinstance(right, Cluster) and right.level() <= threshold:
- if isinstance(left, Cluster):
- return left.getlevel(threshold) + [right.fullyflatten()]
- else:
- return [[left]] + [right.fullyflatten()]
-
- # Alright. We covered the cases where one of the clusters was below the
- # threshold value. Now we'll deal with the clusters that are above by
- # recursively applying the previous cases.
- if isinstance(left, Cluster) and isinstance(right, Cluster):
- return left.getlevel(threshold) + right.getlevel(threshold)
- elif isinstance(left, Cluster):
- return left.getlevel(threshold) + [[right]]
- elif isinstance(right, Cluster):
- return [[left]] + right.getlevel(threshold)
- else:
- return [[left], [right]]
-
-class BaseClusterMethod:
- """
- The base class of all clustering methods.
- """
-
- def __init__(self, input, distance_function):
- """
- Constructs the object and starts clustering
-
- PARAMETERS
- input - a list of objects
- distance_function - a function returning the distance - or opposite of
- similarity ( distance = -similarity ) - of two
- items from the input. In other words, the closer
- the two items are related, the smaller this value
- needs to be. With 0 meaning they are exactly the
- same.
-
- NOTES
- The distance function should always return the absolute distance
- between two given items of the list. Say,
-
- distance(input[1], input[4]) = distance(input[4], input[1])
-
- This is very important for the clustering algorithm to work!
- Naturally, the data returned by the distance function MUST be a
- comparable datatype, so you can perform arithmetic comparisons on
- them (< or >)! The simplest examples would be floats or ints. But as
- long as they are comparable, it's ok.
- """
- self.distance = distance_function
- self._input = input # the original input
- self._data = input[:] # clone the input so we can work with it
-
- def topo(self):
- """
- Returns the structure (topology) of the cluster.
-
- See Cluster.topology() for information.
- """
- return self.data[0].topology()
-
- def __get_data(self):
- """
- Returns the data that is currently in process.
- """
- return self._data
- data = property(__get_data)
-
- def __get_raw_data(self):
- """
- Returns the raw data (data without being clustered).
- """
- return self._input
- raw_data = property(__get_raw_data)
-
-class HierarchicalClustering(BaseClusterMethod):
- """
- Implementation of the hierarchical clustering method as explained in
- http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/hierarchical.html
-
- USAGE
- >>> from cluster import HierarchicalClustering
- >>> # or: from cluster import *
- >>> cl = HierarchicalClustering([123,334,345,242,234,1,3], lambda x,y: float(abs(x-y)))
- >>> cl.getlevel(90)
- [[345, 334], [234, 242], [123], [3, 1]]
-
- Note that all of the returned clusters are more that 90 apart
-
- """
-
- def __init__(self, data, distance_function, linkage='single'):
- """
- Constructor
-
- See BaseClusterMethod.__init__ for more details.
- """
- BaseClusterMethod.__init__(self, data, distance_function)
-
- # set the linkage type to single
- self.setLinkageMethod(linkage)
- self.__clusterCreated = False
-
- def setLinkageMethod(self, method):
- """
- Sets the method to determine the distance between two clusters.
-
- PARAMETERS:
- method - The name of the method to use. It must be one of 'single',
- 'complete', 'average' or 'uclus'
- """
- if method == 'single':
- self.linkage = self.singleLinkageDistance
- elif method == 'complete':
- self.linkage = self.completeLinkageDistance
- elif method == 'average':
- self.linkage = self.averageLinkageDistance
- elif method == 'uclus':
- self.linkage = self.uclusDistance
- else:
- raise ValueError, 'distance method must be one of single, complete, average of uclus'
-
- def uclusDistance(self, x, y):
- """
- The method to determine the distance between one cluster an another
- item/cluster. The distance equals to the *average* (median) distance from
- any member of one cluster to any member of the other cluster.
-
- PARAMETERS
- x - first cluster/item
- y - second cluster/item
- """
- # create a flat list of all the items in <x>
- if not isinstance(x, Cluster): x = [x]
- else: x = x.fullyflatten()
-
- # create a flat list of all the items in <y>
- if not isinstance(y, Cluster): y = [y]
- else: y = y.fullyflatten()
-
- distances = []
- for k in x:
- for l in y:
- distances.append(self.distance(k,l))
- return median(distances)
-
- def averageLinkageDistance(self, x, y):
- """
- The method to determine the distance between one cluster an another
- item/cluster. The distance equals to the *average* (mean) distance from
- any member of one cluster to any member of the other cluster.
-
- PARAMETERS
- x - first cluster/item
- y - second cluster/item
- """
- # create a flat list of all the items in <x>
- if not isinstance(x, Cluster): x = [x]
- else: x = x.fullyflatten()
-
- # create a flat list of all the items in <y>
- if not isinstance(y, Cluster): y = [y]
- else: y = y.fullyflatten()
-
- distances = []
- for k in x:
- for l in y:
- distances.append(self.distance(k,l))
- return mean(distances)
-
- def completeLinkageDistance(self, x, y):
- """
- The method to determine the distance between one cluster an another
- item/cluster. The distance equals to the *longest* distance from any
- member of one cluster to any member of the other cluster.
-
- PARAMETERS
- x - first cluster/item
- y - second cluster/item
- """
-
- # create a flat list of all the items in <x>
- if not isinstance(x, Cluster): x = [x]
- else: x = x.fullyflatten()
-
- # create a flat list of all the items in <y>
- if not isinstance(y, Cluster): y = [y]
- else: y = y.fullyflatten()
-
- # retrieve the minimum distance (single-linkage)
- maxdist = self.distance(x[0], y[0])
- for k in x:
- for l in y:
- maxdist = max(maxdist, self.distance(k,l))
-
- return maxdist
-
- def singleLinkageDistance(self, x, y):
- """
- The method to determine the distance between one cluster an another
- item/cluster. The distance equals to the *shortest* distance from any
- member of one cluster to any member of the other cluster.
-
- PARAMETERS
- x - first cluster/item
- y - second cluster/item
- """
-
- # create a flat list of all the items in <x>
- if not isinstance(x, Cluster): x = [x]
- else: x = x.fullyflatten()
-
- # create a flat list of all the items in <y>
- if not isinstance(y, Cluster): y = [y]
- else: y = y.fullyflatten()
-
- # retrieve the minimum distance (single-linkage)
- mindist = self.distance(x[0], y[0])
- for k in x:
- for l in y:
- mindist = min(mindist, self.distance(k,l))
-
- return mindist
-
- def cluster(self, matrix=None, level=None, sequence=None):
- """
- Perform hierarchical clustering. This method is automatically called by
- the constructor so you should not need to call it explicitly.
-
- PARAMETERS
- matrix - The 2D list that is currently under processing. The matrix
- contains the distances of each item with each other
- level - The current level of clustering
- sequence - The sequence number of the clustering
- """
-
- if matrix is None:
- # create level 0, first iteration (sequence)
- level = 0
- sequence = 0
- matrix = []
-
- # if the matrix only has two rows left, we are done
- while len(matrix) > 2 or matrix == []:
-
- matrix = genmatrix(self._data, self.linkage, True, 0)
-
- smallestpair = None
- mindistance = None
- rowindex = 0 # keep track of where we are in the matrix
- # find the minimum distance
- for row in matrix:
- cellindex = 0 # keep track of where we are in the matrix
- for cell in row:
- # if we are not on the diagonal (which is always 0)
- # and if this cell represents a new minimum...
- if (rowindex != cellindex) and ( cell < mindistance or smallestpair is None ):
- smallestpair = ( rowindex, cellindex )
- mindistance = cell
- cellindex += 1
- rowindex += 1
-
- sequence += 1
- level = matrix[smallestpair[1]][smallestpair[0]]
- cluster = Cluster(level, self._data[smallestpair[0]], self._data[smallestpair[1]])
-
- # maintain the data, by combining the the two most similar items in the list
- # we use the min and max functions to ensure the integrity of the data.
- # imagine: if we first remove the item with the smaller index, all the
- # rest of the items shift down by one. So the next index will be
- # wrong. We could simply adjust the value of the second "remove" call,
- # but we don't know the order in which they come. The max and min
- # approach clarifies that
- self._data.remove(self._data[max(smallestpair[0], smallestpair[1])]) # remove item 1
- self._data.remove(self._data[min(smallestpair[0], smallestpair[1])]) # remove item 2
- self._data.append(cluster) # append item 1 and 2 combined
-
- # all the data is in one single cluster. We return that and stop
- self.__clusterCreated = True
- return
-
- def getlevel(self, threshold):
- """
- Returns all clusters with a maximum distance of <threshold> in between
- each other
-
- PARAMETERS
- threshold - the maximum distance between clusters
-
- SEE-ALSO
- Cluster.getlevel(threshold)
- """
-
- # if it's not worth clustering, just return the data
- if len(self._input) <= 1: return self._input
-
- # initialize the cluster if not yet done
- if not self.__clusterCreated: self.cluster()
-
- return self._data[0].getlevel(threshold)
-
-class KMeansClustering:
- """
- Implementation of the kmeans clustering method as explained in
- http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/kmeans.html
-
- USAGE
- =====
-
- >>> from cluster import KMeansClustering
- >>> cl = KMeansClustering([(1,1), (2,1), (5,3), ...])
- >>> clusters = cl.getclusters(2)
- """
-
- def __init__(self, data, distance=None):
- """
- Constructor
-
- PARAMETERS
- data - A list of tuples or integers.
- distance - A function determining the distance between two items.
- Default: It assumes the tuples contain numeric values and
- appiles a generalised form of the
- euclidian-distance algorithm on them.
- """
- self.__data = data
- self.distance = distance
- self.__initial_length = len(data)
-
- # test if each item is of same dimensions
- if len(data) > 1 and isinstance(data[0], TupleType):
- control_length = len(data[0])
- for item in data[1:]:
- if len(item) != control_length:
- raise ValueError("Each item in the data list must have the same amount of dimensions. Item", item, "was out of line!")
- # now check if we need and have a distance function
- if len(data) > 1 and not isinstance(data[0], TupleType) and distance is None:
- raise ValueError("You supplied non-standard items but no distance function! We cannot continue!")
- # we now know that we have tuples, and assume therefore that it's items are numeric
- elif distance is None:
- self.distance = minkowski_distance
-
- def getclusters(self, n):
- """
- Generates <n> clusters
-
- PARAMETERS
- n - The amount of clusters that should be generated.
- n must be greater than 1
- """
-
- # only proceed if we got sensible input
- if n <= 1:
- raise ClusteringError("When clustering, you need to ask for at least two clusters! You asked for %d" % n)
-
- # return the data straight away if there is nothing to cluster
- if self.__data == [] or len(self.__data) == 1 or n == self.__initial_length:
- return self.__data
-
- # It makes no sense to ask for more clusters than data-items available
- if n > self.__initial_length:
- raise ClusteringError( """Unable to generate more clusters than items
-available. You supplied %d items, and asked for %d clusters.""" %
- (self.__initial_length, n) )
-
- self.initialiseClusters(self.__data, n)
-
- items_moved = True # tells us if any item moved between the clusters,
- # as we initialised the clusters, we assume that
- # is the case
- while items_moved is True:
- items_moved = False
- for cluster in self.__clusters:
- for item in cluster:
- res = self.assign_item(item, cluster)
- if items_moved is False: items_moved = res
- return self.__clusters
-
- def assign_item(self, item, origin):
- """
- Assigns an item from a given cluster to the closest located cluster
-
- PARAMETERS
- item - the item to be moved
- origin - the originating cluster
- """
- closest_cluster = origin
- for cluster in self.__clusters:
- if self.distance(item, centroid(cluster)) < self.distance(item, centroid(closest_cluster)):
- closest_cluster = cluster
-
- if closest_cluster != origin:
- self.move_item(item, origin, closest_cluster)
- return True
- else:
- return False
-
- def move_item(self, item, origin, destination):
- """
- Moves an item from one cluster to anoter cluster
-
- PARAMETERS
-
- item - the item to be moved
- origin - the originating cluster
- destination - the target cluster
- """
- destination.append( origin.pop( origin.index(item) ) )
-
- def initialiseClusters(self, input, clustercount):
- """
- Initialises the clusters by distributing the items from the data evenly
- across n clusters
-
- PARAMETERS
- input - the data set (a list of tuples)
- clustercount - the amount of clusters (n)
- """
- # initialise the clusters with empty lists
- self.__clusters = []
- for x in xrange(clustercount): self.__clusters.append([])
-
- # distribute the items into the clusters
- count = 0
- for item in input:
- self.__clusters[ count % clustercount ].append(item)
- count += 1
-
diff --git a/cluster/__init__.py b/cluster/__init__.py
new file mode 100644
index 0000000..49567fa
--- /dev/null
+++ b/cluster/__init__.py
@@ -0,0 +1,25 @@
+#
+# This is part of "python-cluster". A library to group similar items together.
+# Copyright (C) 2006 Michel Albert
+#
+# This library is free software; you can redistribute it and/or modify it
+# under the terms of the GNU Lesser General Public License as published by the
+# Free Software Foundation; either version 2.1 of the License, or (at your
... 2585 lines suppressed ...
--
Alioth's /usr/local/bin/git-commit-notice on /srv/git.debian.org/git/python-modules/packages/python-cluster.git
More information about the Python-modules-commits
mailing list