您好,欢迎访问三七文档
ToAppearintheIEEEComputerCHAMELEON:AHierarchicalClusteringAlgorithmUsingDynamicModelingGeorgeKarypisEui-Hong(Sam)HanVipinKumarDepartmentofComputerScienceandEngineeringUniversityofMinnesota4-192EECSBldg.,200UnionSt.SEMinneapolis,MN55455,USATechnicalReport#99–007{karypis,han,kumar}@cs.umn.eduAbstractClusteringindataminingisadiscoveryprocessthatgroupsasetofdatasuchthattheintraclustersimilarityismaximizedandtheinterclustersimilarityisminimized.Existingclusteringalgorithms,suchasK-means,PAM,CLARANS,DBSCAN,CURE,andROCKaredesignedtofindclustersthatfitsomestaticmodels.Thesealgorithmscanbreakdownifthechoiceofparametersinthestaticmodelisincorrectwithrespecttothedatasetbeingclustered,orifthemodelisnotadequatetocapturethecharacteristicsofclusters.Furthermore,mostofthesealgorithmsbreakdownwhenthedataconsistsofclustersthatareofdiverseshapes,densities,andsizes.Inthispaper,wepresentanovelhierarchicalclusteringalgorithmcalledCHAMELEONthatmeasuresthesimilarityoftwoclustersbasedonadynamicmodel.Intheclusteringprocess,twoclustersaremergedonlyiftheinter-connectivityandcloseness(proximity)betweentwoclustersarehighrelativetotheinternalinter-connectivityoftheclustersandclosenessofitemswithintheclusters.Themergingprocessusingthedynamicmodelpresentedinthispaperfacilitatesdiscoveryofnaturalandhomogeneousclusters.ThemethodologyofdynamicmodelingofclustersusedinCHAMELEONisapplicabletoalltypesofdataaslongasasimilaritymatrixcanbeconstructed.WedemonstratetheeffectivenessofCHAMELEONinanumberofdatasetsthatcontainpointsin2Dspace,andcontainclustersofdifferentshapes,densities,sizes,noise,andartifacts.ExperimentalresultsonthesedatasetsshowthatCHAMELEONcandiscovernaturalclustersthatmanyexistingstate-of-theartclusteringalgorithmsfailtofind.Keywords:Clustering,datamining,dynamicmodeling,graphpartitioning,k-nearestneighborgraph.1IntroductionClusteringindatamining[SAD+93,CHY96]isadiscoveryprocessthatgroupsasetofdatasuchthattheintraclustersimilarityismaximizedandtheinterclustersimilarityisminimized[JD88,KR90,PAS96,CHY96].Thesediscoveredclusterscanbeusedtoexplainthecharacteristicsoftheunderlyingdatadistribution,andthusserveasthefoundationforotherdataminingandanalysistechniques.Theapplicationsofclusteringincludecharacterizationofdifferentcustomergroupsbaseduponpurchasingpatterns,categorizationofdocumentsontheWorldWideWeb[BGG+99a,BGG+99b],groupingofgenesandproteinsthathavesimilarfunctionality[HHS92,NRS+95,SCC+95,HKKM98],groupingofspatiallocationspronetoearthquakesfromseismologicaldata[BR98,XEKS98],etc.Existingclusteringalgorithms,suchasK-means[JD88],PAM[KR90],CLARANS[NH94],DBSCAN[EKSX96],CURE[GRS98],andROCK[GRS99]aredesignedtofindclustersthatfitsomestaticmodels.Forexample,K-means,PAM,andCLARANSassumethatclustersarehyper-ellipsoidal(orglobular)andareofsimilarsizes.DBSCANassumesthatallpointswithingenuineclustersaredensityreachable1andpointsacrossdifferentclustersarenot.Agglomerativehierarchicalclusteringalgorithms,suchasCUREandROCKuseastaticmodeltodeterminethemostsimilarclustertomergeinthehierarchicalclustering.CUREmeasuresthesimilarityoftwoclustersbasedonthesimilarityoftheclosestpairoftherepresentativepointsbelongingtodifferentclusters,withoutconsideringtheinternalcloseness(i.e.,densityorhomogeneity)ofthetwoclustersinvolved.ROCKmeasuresthesimilarityoftwoclustersbycomparingtheaggregateinter-connectivityoftwoclustersagainstauser-specifiedstaticinter-connectivitymodel,andthusignoresthepotentialvariationsintheinter-connectivityofdifferentclusterswithinthesamedataset.Thesealgorithmscanbreakdownifthechoiceofparametersinthestaticmodelisincorrectwithrespecttothedatasetbeingclustered,orifthemodelisnotadequatetocapturethecharacteristicsofclusters.Furthermore,mostofthesealgorithmsbreakdownwhenthedataconsistsofclustersthatareofdiverseshapes,densities,andsizes.Inthispaper,wepresentanovelhierarchicalclusteringalgorithmcalledCHAMELEONthatmeasuresthesim-ilarityoftwoclustersbasedonadynamicmodel.Intheclusteringprocess,twoclustersaremergedonlyiftheinter-connectivityandcloseness(proximity)betweentwoclustersarecomparabletotheinternalinter-connectivityoftheclustersandclosenessofitemswithintheclusters.Themergingprocessusingthedynamicmodelpresentedinthispaperfacilitatesdiscoveryofnaturalandhomogeneousclusters.ThemethodologyofdynamicmodelingofclustersusedinCHAMELEONisapplicabletoalltypesofdataaslongasasimilaritymatrixcanbeconstructed.WedemonstratetheeffectivenessofCHAMELEONinanumberofdatasetsthatcontainpointsin2Dspace,andcontainclustersofdifferentshapes,densities,sizes,noise,andartifacts.Therestofthepaperisorganizedasfollows.Section2givesanoverviewofrelatedclusteringalgorithms.Section3presentsthelimitationsoftherecentlyproposedstateoftheartclusteringalgorithms.WepresentournewclusteringalgorithminSection4.Section5givestheexperimentalresults.Section6containsconclusionsanddirectionsforfuturework.2RelatedWorkInthissection,wegiveabriefdescriptionofexistingclusteringalgorithms.1Apointpisdensityreachablefromapointq,iftheyareconnectedbyachainofpointssuchthateachpointhasminimalnumberofdatapoints
本文标题:CHAMELEON A Hierarchical Clustering Algorithm Usin
链接地址:https://www.777doc.com/doc-6209990 .html