您好,欢迎访问三七文档
LOGOClusteringOverviewPartitioningMethodsK-MeansSequentialLeaderModelBasedMethodsDensityBasedMethodsHierarchicalMethods2Whatisclusteranalysis?FindinggroupsofobjectsObjectssimilartoeachotherareinthesamegroup.Objectsaredifferentfromthoseinothergroups.UnsupervisedLearningNolabelsDatadriven3Clusters4Inter-ClusterIntra-ClusterClusters5ApplicationsofClusteringMarketingFindinggroupsofcustomerswithsimilarbehaviours.BiologyFindinggroupsofanimalsorplantswithsimilarfeatures.BioinformaticsClusteringmicroarraydata,genesandsequences.EarthquakeStudiesClusteringobservedearthquakeepicenterstoidentifydangerouszones.Clusteringweblogdatatodiscovergroupsofsimilaraccesspatterns.SocialNetworksDiscoveringgroupsofindividualswithclosefriendshipsinternally.6Earthquakes7ImageSegmentation8TheBigPicture9RequirementsScalabilityAbilitytodealwithdifferenttypesofattributesAbilitytodiscoverclusterswitharbitraryshapeMinimumrequirementsfordomainknowledgeAbilitytodealwithnoiseandoutliersInsensitivitytoorderofinputrecordsIncorporationofuser-definedconstraintsInterpretabilityandusability1001234560123456X1=0.2XY1243PracticalConsiderations1101234560123456XY134201234560123456XY1=0.2Y1243Scalingmatters!NormalizationorNot?1213Evaluation14iiDxiiciDxiexnmmxJ1,12VS.Evaluation15SilhouetteAmethodofinterpretationandvalidationofclustersofdata.Asuccinctgraphicalrepresentationofhowwelleachdatapointlieswithinitsclustercomparedtootherclusters.a(i):averagedissimilarityofiwithallotherpointsinthesameclusterb(i):thelowestaveragedissimilarityofitootherclusters16)}(),(max{)()()(iaibiaibisSilhouette17-3-2-101234-3-2-101234-0.200.20.40.60.8112SilhouetteValueClusterK-Means18K-Means19K-Means20K-MeansDeterminethevalueofK.ChooseKclustercentresrandomly.Eachdatapointisassignedtoitsclosestcentroid.Usethemeanofeachclustertoupdateeachcentroid.Repeatuntilnomorenewassignment.ReturntheKcentroids.ReferenceJ.MacQueen(1967):SomeMethodsforClassificationandAnalysisofMultivariateObservations,Proceedingsofthe5thBerkeleySymposiumonMathematicalStatisticsandProbability,vol.1,pp.281-297.21CommentsonK-MeansProsSimpleandworkswellforregulardisjointclusters.Convergesrelativelyfast.RelativelyefficientandscalableO(t·k·n)•t:iteration;k:numberofcentroids;n:numberofdatapointsConsNeedtospecifythevalueofKinadvance.•Difficultanddomainknowledgemayhelp.Mayconvergetolocaloptima.•Inpractice,trydifferentinitialcentroids.Maybesensitivetonoisydataandoutliers.•Meanofdatapoints…Notsuitableforclustersof•Non-convexshapes22TheInfluenceofInitialCentroids23TheInfluenceofInitialCentroids24SequentialLeaderClusteringAveryefficientclusteringalgorithm.NoiterationAsinglepassofthedataNoneedtospecifyKinadvance.Chooseaclusterthresholdvalue.Foreverynewdatapoint:Computethedistancebetweenthenewdatapointandeverycluster'scentre.Iftheminimumdistanceissmallerthanthechosenthreshold,assignthenewdatapointtothecorrespondingclusterandre-computeclustercentre.Otherwise,createanewclusterwiththenewdatapointasitscentre.Clusteringresultsmaybeinfluencedbythesequenceofdatapoints.2526GaussianMixture27)2/()(22221),,(xexg1&0,),,()(1iiiniiiixgxfClusteringbyMixtureModels28K-MeansRevisited29𝜃={𝑥1,𝑦1,(𝑥2,𝑦2)}𝑍={𝐶𝑙𝑢𝑠𝑡𝑒𝑟1,𝐶𝑙𝑢𝑠𝑡𝑒𝑟2}modelparameterslatentparametersExpectationMaximization3031𝑃𝐴𝐸=𝑃𝐸𝐴𝑃(𝐴)𝑃(𝐸)EM:GaussianMixture32Gaussianjthbythegeneratedisiinstancerwhethe:componentsmixtureofnumberthe:pointsdataofnumberthe:ijznmnkkxjxnkkkijjiijkijieexxpxxpzE1)(21)(2112222)|()|(][miijmiiijjzExzE11][][miijjzEm1][133DensityBasedMethodsGenerateclustersofarbitraryshapes.Robustagainstnoise.NoKvaluerequiredinadvance.Somewhatsimilartohumanvision.34DBSCANDensity-BasedSpatialClusteringofApplicationswithNoiseDensity:numberofpointswithinaspecifiedradiusCorePoint:pointswithhighdensityBorderPoint:pointswithlowdensitybutintheneighbourhoodofacorepointNoisePoint:neitheracorepointnoraborderpoint35CorePointNoisePointBorderPointDBSCAN36pqdirectlydensityreachablepqdensityreachableoqpdensityconnectedDBSCANAclusterisdefinedasthemaximalsetofdensityconnectedpoints.StartfromarandomlyselectedunseenpointP.IfPisacorepoint,buildaclusterbygraduallyaddingallpointsthataredensityreachabletothecurrentpointset.Noisepointsarediscarded(unlabelled).37HierarchicalClusteringProduceasetofnestedtree-likeclusters.Canbevisualizedasadendrogram.Clusteringisobtainedbycuttingatdesiredlevel.NoneedtospecifyKinadvance.Maycorrespondtomeaningfultaxonomies.38AgglomerativeMethodsBottom-upMethodAssigneachdatapointtoacluster.Calculatetheproximitymatrix.Mergethepairofclosestclusters.Repeatuntilonlyasingleclusterremains.Howtocalculatethedistancebetweenclusters?SingleLinkMinimumdistancebetweenpointsCompleteLinkMaximumdistancebetweenpoints39Example40BAFIMINARMTOBA0662877255412996FI6620295468268400MI8772950754564138NA2554687540219869RM4122685642190669TO9964001388696690SingleLinkExample41BAFIMI/TONARMBA0662877255412FI6620295468268MI/TO877
本文标题:数据挖掘之聚类分析
链接地址:https://www.777doc.com/doc-7243254 .html