CHAMELEON A Hierarchical Clustering Algorithm Usin

1、ToAppearintheIEEEComputerCHAMELEON:AHierarchicalClusteringAlgorithmUsingDynamicModelingGeorgeKarypisEui-Hong(Sam)HanVipinKumarDepartmentofComputerScienceandEngineeringUniversityofMinnesota4-192EECSBldg.,200UnionSt.SEMinneapolis,MN55455,USATechnicalReport#99–007{karypis,han,kumar}@cs.umn.eduAbstractClusteringindataminingisadiscoveryprocessthatgroupsasetofdatasuchthattheintraclustersimilarityismaximizedandtheinterclustersimilarityisminimized.Existingclusteringalgorithms,suchasK-means,PAM,CLARANS,。

2、DBSCAN,CURE,andROCKaredesignedtoﬁndclustersthatﬁtsomestaticmodels.Thesealgorithmscanbreakdownifthechoiceofparametersinthestaticmodelisincorrectwithrespecttothedatasetbeingclustered,orifthemodelisnotadequatetocapturethecharacteristicsofclusters.Furthermore,mostofthesealgorithmsbreakdownwhenthedataconsistsofclustersthatareofdiverseshapes,densities,andsizes.Inthispaper,wepresentanovelhierarchicalclusteringalgorithmcalledCHAMELEONthatmeasuresthesimilarityoftwoclustersbasedonadynamicmodel.Inthecluste。

3、ringprocess,twoclustersaremergedonlyiftheinter-connectivityandcloseness(proximity)betweentwoclustersarehighrelativetotheinternalinter-connectivityoftheclustersandclosenessofitemswithintheclusters.Themergingprocessusingthedynamicmodelpresentedinthispaperfacilitatesdiscoveryofnaturalandhomogeneousclusters.ThemethodologyofdynamicmodelingofclustersusedinCHAMELEONisapplicabletoalltypesofdataaslongasasimilaritymatrixcanbeconstructed.WedemonstratetheeffectivenessofCHAMELEONinanumberofdatasetsthatcontai。

4、npointsin2Dspace,andcontainclustersofdifferentshapes,densities,sizes,noise,andartifacts.ExperimentalresultsonthesedatasetsshowthatCHAMELEONcandiscovernaturalclustersthatmanyexistingstate-of-theartclusteringalgorithmsfailtoﬁnd.Keywords:Clustering,datamining,dynamicmodeling,graphpartitioning,k-nearestneighborgraph.1IntroductionClusteringindatamining[SAD+93,CHY96]isadiscoveryprocessthatgroupsasetofdatasuchthattheintraclustersimilarityismaximizedandtheinterclustersimilarityisminimized[JD88,KR90,PAS9。

5、6,CHY96].Thesediscoveredclusterscanbeusedtoexplainthecharacteristicsoftheunderlyingdatadistribution,andthusserveasthefoundationforotherdataminingandanalysistechniques.Theapplicationsofclusteringincludecharacterizationofdifferentcustomergroupsbaseduponpurchasingpatterns,categorizationofdocumentsontheWorldWideWeb[BGG+99a,BGG+99b],groupingofgenesandproteinsthathavesimilarfunctionality[HHS92,NRS+95,SCC+95,HKKM98],groupingofspatiallocationspronetoearthquakesfromseismologicaldata[BR98,XEKS98],etc.Exis。

6、tingclusteringalgorithms,suchasK-means[JD88],PAM[KR90],CLARANS[NH94],DBSCAN[EKSX96],CURE[GRS98],andROCK[GRS99]aredesignedtoﬁndclustersthatﬁtsomestaticmodels.Forexample,K-means,PAM,andCLARANSassumethatclustersarehyper-ellipsoidal(orglobular)andareofsimilarsizes.DBSCANassumesthatallpointswithingenuineclustersaredensityreachable1andpointsacrossdifferentclustersarenot.Agglomerativehierarchicalclusteringalgorithms,suchasCUREandROCKuseastaticmodeltodeterminethemostsimilarclustertomergeinthehierarchica。

7、lclustering.CUREmeasuresthesimilarityoftwoclustersbasedonthesimilarityoftheclosestpairoftherepresentativepointsbelongingtodifferentclusters,withoutconsideringtheinternalcloseness(i.e.,densityorhomogeneity)ofthetwoclustersinvolved.ROCKmeasuresthesimilarityoftwoclustersbycomparingtheaggregateinter-connectivityoftwoclustersagainstauser-speciﬁedstaticinter-connectivitymodel,andthusignoresthepotentialvariationsintheinter-connectivityofdifferentclusterswithinthesamedataset.Thesealgorithmscanbreakdowni。

8、fthechoiceofparametersinthestaticmodelisincorrectwithrespecttothedatasetbeingclustered,orifthemodelisnotadequatetocapturethecharacteristicsofclusters.Furthermore,mostofthesealgorithmsbreakdownwhenthedataconsistsofclustersthatareofdiverseshapes,densities,andsizes.Inthispaper,wepresentanovelhierarchicalclusteringalgorithmcalledCHAMELEONthatmeasuresthesim-ilarityoftwoclustersbasedonadynamicmodel.Intheclusteringprocess,twoclustersaremergedonlyiftheinter-connectivityandcloseness(proximity)betweentwoc。

9、lustersarecomparabletotheinternalinter-connectivityoftheclustersandclosenessofitemswithintheclusters.Themergingprocessusingthedynamicmodelpresentedinthispaperfacilitatesdiscoveryofnaturalandhomogeneousclusters.ThemethodologyofdynamicmodelingofclustersusedinCHAMELEONisapplicabletoalltypesofdataaslongasasimilaritymatrixcanbeconstructed.WedemonstratetheeffectivenessofCHAMELEONinanumberofdatasetsthatcontainpointsin2Dspace,andcontainclustersofdifferentshapes,densities,sizes,noise,andartifacts.Therest。

10、ofthepaperisorganizedasfollows.Section2givesanoverviewofrelatedclusteringalgorithms.Section3presentsthelimitationsoftherecentlyproposedstateoftheartclusteringalgorithms.WepresentournewclusteringalgorithminSection4.Section5givestheexperimentalresults.Section6containsconclusionsanddirectionsforfuturework.2RelatedWorkInthissection,wegiveabriefdescriptionofexistingclusteringalgorithms.1Apointpisdensityreachablefromapointq,iftheyareconnectedbyachainofpointssuchthateachpointhasminimalnumberofdatapoints。