﻿fastcluster: Fast hierarchical clustering routines for R and Python\\r\\n\\r\\nCopyright © 2011 Daniel Müllner\\r\\n<http://danifold.net>\\r\\n\\r\\nThe fastcluster package is a C++ library for hierarchical, agglomerative\\r\\nclustering. It efficiently implements the seven most widely used clustering\\r\\nschemes: single, complete, average, weighted/McQuitty, Ward, centroid and\\r\\nmedian linkage. The library currently has interfaces to two languages: R and\\r\\nPython/NumPy. Part of the functionality is designed as drop-in replacement for\\r\\nexisting routines: ??linkage? in the SciPy package ??scipy.cluster.hierarchy?,\\r\\n??hclust? in R's ??stats? package, and the ??flashClust? package. Once the\\r\\nfastcluster library is loaded at the beginning of the code, every program that\\r\\nuses hierarchical clustering can benefit immediately and effortlessly from the\\r\\nperformance gain. Moreover, there are memory-saving routines for clustering of\\r\\nvector data, which go beyond what the existing packages provide.\\r\\n\\r\\nSee the author's home page <http://danifold.net> for more\\r\\ninformation, in particular a performance comparison with other clustering\\r\\npackages. The User's manual is the file inst/doc/fastcluster.pdf in the\\r\\nsource distribution.\\r\\n\\r\\nThe fastcluster package is distributed under the BSD license. See the file\\r\\nLICENSE in the source distribution or\\r\\n<http://opensource.org/licenses/BSD-2-Clause>.\\r\\n\\r\\n\\r\\nInstallation\\r\\n????????????\\r\\nSee the file INSTALL in the source distribution.\\r\\n\\r\\n\\r\\nUsage\\r\\n?????\\r\\n1.??R\\r\\n????\\r\\nIn R, load the package with the following command:\\r\\n\\r\\n    library('fastcluster')\\r\\n\\r\\nThe package overwrites the function hclust from the ??stats? package (in the\\r\\nsame way as the flashClust package does). Please remove any references to the\\r\\nflashClust package in your R files to not accidentally overwrite the hclust\\r\\nfunction with the flashClust version.\\r\\n\\r\\nThe new hclust function has exactly the same calling conventions as the old\\r\\none. You may just load the package and immediately and effortlessly enjoy the\\r\\nperformance improvements. The function is also an improvement to the flashClust\\r\\nfunction from the ??flashClust? package. Just replace every call to flashClust\\r\\nby hclust and expect your code to work as before, only faster. (If you are\\r\\nusing flashClust prior to version 1.01, update it! See the change log for\\r\\nflashClust:\\r\\n\\r\\n    http://cran.r-project.org/web/packages/flashClust/ChangeLog )\\r\\n\\r\\nIf you need to access the old function or make sure that the right function is\\r\\ncalled, specify the package as follows:\\r\\n\\r\\n    fastcluster::hclust(?)\\r\\n    flashClust::hclust(?)\\r\\n    stats::hclust(?)\\r\\n\\r\\nVector data can be clustered with a memory-saving algorithm with the command\\r\\n\\r\\n    hclust.vector(?)\\r\\n\\r\\nSee the User's manual inst/doc/fastcluster.pdf for further details.\\r\\n\\r\\nWARNING\\r\\n???????\\r\\nR and Matlab/SciPy use different conventions for the ??Ward?, ??centroid? and\\r\\n??median? methods. R assumes that the dissimilarity matrix consists of squared\\r\\nEuclidean distances, while Matlab and SciPy expect non-squared Euclidean\\r\\ndistances. The fastcluster package respects these conventions and uses\\r\\ndifferent formulas in the two interfaces.\\r\\n\\r\\nIf you want the same results in both interfaces, then feed the hclust function\\r\\nin R with the entry-wise square of the distance matrix, D^2, for the ??Ward?,\\r\\n??centroid? and ??median? methods and later take the square root of the height\\r\\nfield in the dendrogram. For the ??average? and ??weighted? alias ??mcquitty?\\r\\nmethods, you must still take the same distance matrix D as in the Python\\r\\ninterface for the same results. The ??single? and ??complete? methods only depend\\r\\non the relative order of the distances, hence it does not make a difference\\r\\nwhether the method operates on the distances or the squared distances.\\r\\n\\r\\nThe code example in the R documentation (enter ?hclust or example(hclust) in R)\\r\\ncontains an instance where the squared distance matrix is generated from\\r\\nEuclidean data.\\r\\n\\r\\n2. Python\\r\\n?????????\\r\\nThe fastcluster package is imported as usual by\\r\\n\\r\\n    import fastcluster\\r\\n\\r\\nIt provides the following functions:\\r\\n\\r\\n    linkage(X, method='single', metric='euclidean', preserve_input=True)\\r\\n    single(X)\\r\\n    complete(X)\\r\\n    average(X)\\r\\n    weighted(X)\\r\\n    ward(X)\\r\\n    centroid(X)\\r\\n    median(X)\\r\\n    linkage_vector(X, method='single', metric='euclidean', extraarg=None)\\r\\n\\r\\nThe argument X is either a compressed distance matrix or a collection of n\\r\\nobservation vectors in d dimensions as an (n?d) array. Apart from the argument\\r\\npreserve_input, the methods have the same input and output as the functions of\\r\\nthe same name in the package scipy.cluster.hierarchy.\\r\\n\\r\\nThe additional, optional argument preserve_input specifies whether the\\r\\nfastcluster package first copies the distance matrix or writes into the\\r\\nexisting array. If the dissimilarities are generated for the clustering step\\r\\nonly and are not needed afterward, approximately half the memory can be saved\\r\\nby specifying preserve_input=False. Note that the input array X contains\\r\\nunspecified values after this procedure. You may want to write\\r\\n\\r\\n    linkage(X, method='?', preserve_input=False)\\r\\n    del X\\r\\n\\r\\nto make sure that the matrix X is not accessed accidentally after it has been\\r\\nused as scratch memory.\\r\\n\\r\\nThe method\\r\\n\\r\\n    linkage_vector(X, method='single', metric='euclidean', extraarg=None)\\r\\n\\r\\nprovides memory-saving clustering for vector data. It also accepts a collection\\r\\nof n observation vectors in d dimensions as an (n?d) array as the first parameter.\\r\\nThe parameter 'method' is either 'single', 'ward', 'centroid' or 'median'. The\\r\\n'ward', 'centroid' and 'median' methods require the Euclidean metric. In case\\r\\nof single linkage, the 'metric' parameter can be chosen from all metrics which\\r\\nare implemented in scipy.spatial.dist.pdist. There may be differences between\\r\\n\\r\\n    linkage(scipy.spatial.dist.pdist(X, metric='?'))\\r\\nand\\r\\n    linkage_vector(X, metric='?')\\r\\n\\r\\nsince there have been made a few corrections compared to the pdist function.\\r\\nPlease consult the the User's manual inst/doc/fastcluster.pdf for\\r\\ncomprehensive details.\\r\\n