您好,欢迎访问三七文档
DataMiningClusterAnalysis:AdvancedConceptsandAlgorithmsLectureNotesforChapter9IntroductiontoDataMiningbyTan,Steinbach,Kumar©Tan,Steinbach,KumarIntroductiontoDataMining4/18/20041©Tan,Steinbach,KumarIntroductiontoDataMining4/18/20042HierarchicalClustering:RevisitedzCreatesnestedclusterszAgglomerativeclusteringalgorithmsvaryintermsofhowtheproximityoftwoclustersarecomputedMIN(singlelink):susceptibletonoise/outliersMAX/GROUPAVERAGE:maynotworkwellwithnon-globularclusters–CUREalgorithmtriestohandlebothproblemszOftenstartswithaproximitymatrix–Atypeofgraph-basedalgorithm©Tan,Steinbach,KumarIntroductiontoDataMining4/18/20043zUsesanumberofpointstorepresentaclusterzRepresentativepointsarefoundbyselectingaconstantnumberofpointsfromaclusterandthen“shrinking”themtowardthecenteroftheclusterzClustersimilarityisthesimilarityoftheclosestpairofrepresentativepointsfromdifferentclustersCURE:AnotherHierarchicalApproach×שTan,Steinbach,KumarIntroductiontoDataMining4/18/20044CUREzShrinkingrepresentativepointstowardthecenterhelpsavoidproblemswithnoiseandoutlierszCUREisbetterabletohandleclustersofarbitraryshapesandsizes©Tan,Steinbach,KumarIntroductiontoDataMining4/18/20045ExperimentalResults:CUREPicturefromCURE,Guha,Rastogi,Shim.©Tan,Steinbach,KumarIntroductiontoDataMining4/18/20046ExperimentalResults:CUREPicturefromCURE,Guha,Rastogi,Shim.(centroid)(singlelink)©Tan,Steinbach,KumarIntroductiontoDataMining4/18/20047CURECannotHandleDifferingDensitiesOriginalPointsCURE©Tan,Steinbach,KumarIntroductiontoDataMining4/18/20048Graph-BasedClusteringzGraph-Basedclusteringusestheproximitygraph–Startwiththeproximitymatrix–Considereachpointasanodeinagraph–Eachedgebetweentwonodeshasaweightwhichistheproximitybetweenthetwopoints–Initiallytheproximitygraphisfullyconnected–MIN(single-link)andMAX(complete-link)canbeviewedasstartingwiththisgraphzInthesimplestcase,clustersareconnectedcomponentsinthegraph.©Tan,Steinbach,KumarIntroductiontoDataMining4/18/20049Graph-BasedClustering:SparsificationzTheamountofdatathatneedstobeprocessedisdrasticallyreduced–Sparsificationcaneliminatemorethan99%oftheentriesinaproximitymatrix–Theamountoftimerequiredtoclusterthedataisdrasticallyreduced–Thesizeoftheproblemsthatcanbehandledisincreased©Tan,Steinbach,KumarIntroductiontoDataMining4/18/200410Graph-BasedClustering:Sparsification…zClusteringmayworkbetter–Sparsificationtechniqueskeeptheconnectionstothemostsimilar(nearest)neighborsofapointwhilebreakingtheconnectionstolesssimilarpoints.–Thenearestneighborsofapointtendtobelongtothesameclassasthepointitself.–Thisreducestheimpactofnoiseandoutliersandsharpensthedistinctionbetweenclusters.zSparsificationfacilitatestheuseofgraphpartitioningalgorithms(oralgorithmsbasedongraphpartitioningalgorithms.–ChameleonandHypergraph-basedClustering©Tan,Steinbach,KumarIntroductiontoDataMining4/18/200411SparsificationintheClusteringProcess©Tan,Steinbach,KumarIntroductiontoDataMining4/18/200412LimitationsofCurrentMergingSchemeszExistingmergingschemesinhierarchicalclusteringalgorithmsarestaticinnature–MINorCURE:mergetwoclustersbasedontheircloseness(orminimumdistance)–GROUP-AVERAGE:mergetwoclustersbasedontheiraverageconnectivity©Tan,Steinbach,KumarIntroductiontoDataMining4/18/200413LimitationsofCurrentMergingSchemesClosenessschemeswillmerge(a)and(b)(a)(b)(c)(d)Averageconnectivityschemeswillmerge(c)and(d)©Tan,Steinbach,KumarIntroductiontoDataMining4/18/200414Chameleon:ClusteringUsingDynamicModelingzAdapttothecharacteristicsofthedatasettofindthenaturalclusterszUseadynamicmodeltomeasurethesimilaritybetweenclusters–Mainpropertyistherelativeclosenessandrelativeinter-connectivityofthecluster–Twoclustersarecombinediftheresultingclustersharescertainpropertieswiththeconstituentclusters–Themergingschemepreservesself-similarityzOneoftheareasofapplicationisspatialdata©Tan,Steinbach,KumarIntroductiontoDataMining4/18/200415CharacteristicsofSpatialDataSets•Clustersaredefinedasdenselypopulatedregionsofthespace•Clustershavearbitraryshapes,orientation,andnon-uniformsizes•Differenceindensitiesacrossclustersandvariationindensitywithinclusters•Existenceofspecialartifacts(streaks)andnoiseTheclusteringalgorithmmustaddresstheabovecharacteristicsandalsorequireminimalsupervision.©Tan,Steinbach,KumarIntroductiontoDataMining4/18/200416Chameleon:StepszPreprocessingStep:RepresenttheDatabyaGraph–Givenasetofpoints,constructthek-nearest-neighbor(k-NN)graphtocapturetherelationshipbetweenapointanditsknearestneighbors–Conceptofneighborhoodiscaptureddynamically(evenifregionissparse)zPhase1:Useamultilevelgraphpartitioningalgorithmonthegraphtofindalargenumberofclustersofwell-connectedvertices–Eachclustershouldcontainmostlypointsfromone“true”cluster,i.e.,isasub-clusterofa“real”cluster©Tan,Steinbach,KumarIntroductiontoDataMining4/18/200417Chameleon:Steps…zPhase2:UseHierarchicalAgglomerativeClusteringtomergesub-clusters–Twoclustersarecombinediftheresultingclustersharescertainpropertieswiththeconstituentclusters–Twokeypropertiesusedtomodelclustersimilarity:RelativeInterconnectivity:Absol
本文标题:Introduction to data mining-lecture notes chap09_a
链接地址:https://www.777doc.com/doc-6131179 .html