您好,欢迎访问三七文档
AbstractClusteringalgorithmsareattractiveforthetaskofclassiden-tificationinspatialdatabases.However,theapplicationtolargespatialdatabasesrisesthefollowingrequirementsforclusteringalgorithms:minimalrequirementsofdomainknowledgetodeterminetheinputparameters,discoveryofclusterswitharbitraryshapeandgoodefficiencyonlargeda-tabases.Thewell-knownclusteringalgorithmsoffernosolu-tiontothecombinationoftheserequirements.Inthispaper,wepresentthenewclusteringalgorithmDBSCANrelyingonadensity-basednotionofclusterswhichisdesignedtodis-coverclustersofarbitraryshape.DBSCANrequiresonlyoneinputparameterandsupportstheuserindetermininganap-propriatevalueforit.Weperformedanexperimentalevalua-tionoftheeffectivenessandefficiencyofDBSCANusingsyntheticdataandrealdataoftheSEQUOIA2000bench-mark.Theresultsofourexperimentsdemonstratethat(1)DBSCANissignificantlymoreeffectiveindiscoveringclus-tersofarbitraryshapethanthewell-knownalgorithmCLAR-ANS,andthat(2)DBSCANoutperformsCLARANSbyafactorofmorethan100intermsofefficiency.Keywords:ClusteringAlgorithms,ArbitraryShapeofClus-ters,EfficiencyonLargeSpatialDatabases,HandlingNoise.1.IntroductionNumerousapplicationsrequirethemanagementofspatialdata,i.e.datarelatedtospace.SpatialDatabaseSystems(SDBS)(Gueting1994)aredatabasesystemsfortheman-agementofspatialdata.Increasinglylargeamountsofdataareobtainedfromsatelliteimages,X-raycrystallographyorotherautomaticequipment.Therefore,automatedknow-ledgediscoverybecomesmoreandmoreimportantinspatialdatabases.Severaltasksofknowledgediscoveryindatabases(KDD)havebeendefinedintheliterature(Matheus,Chan&Pi-atetsky-Shapiro1993).Thetaskconsideredinthispaperisclassidentification,i.e.thegroupingoftheobjectsofadata-baseintomeaningfulsubclasses.Inanearthobservationda-tabase,e.g.,wemightwanttodiscoverclassesofhousesalongsomeriver.Clusteringalgorithmsareattractiveforthetaskofclassidentification.However,theapplicationtolargespatialdata-basesrisesthefollowingrequirementsforclusteringalgo-rithms:(1)Minimalrequirementsofdomainknowledgetodeter-minetheinputparameters,becauseappropriatevaluesareoftennotknowninadvancewhendealingwithlargedatabases.(2)Discoveryofclusterswitharbitraryshape,becausetheshapeofclustersinspatialdatabasesmaybespherical,drawn-out,linear,elongatedetc.(3)Goodefficiencyonlargedatabases,i.e.ondatabasesofsignificantlymorethanjustafewthousandobjects.Thewell-knownclusteringalgorithmsoffernosolutiontothecombinationoftheserequirements.Inthispaper,wepresentthenewclusteringalgorithmDBSCAN.Itrequiresonlyoneinputparameterandsupportstheuserindetermin-inganappropriatevalueforit.Itdiscoversclustersofarbi-traryshape.Finally,DBSCANisefficientevenforlargespa-tialdatabases.Therestofthepaperisorganizedasfollows.Wediscussclusteringalgorithmsinsection2evaluatingthemaccordingtotheaboverequirements.Insection3,wepresentournotionofclusterswhichisbasedontheconceptofdensityinthedatabase.Section4introducesthealgo-rithmDBSCANwhichdiscoverssuchclustersinaspatialdatabase.Insection5,weperformedanexperimentalevalu-ationoftheeffectivenessandefficiencyofDBSCANusingsyntheticdataanddataoftheSEQUOIA2000benchmark.Section6concludeswithasummaryandsomedirectionsforfutureresearch.2.ClusteringAlgorithmsTherearetwobasictypesofclusteringalgorithms(Kaufman&Rousseeuw1990):partitioningandhierarchicalalgo-rithms.Partitioningalgorithmsconstructapartitionofada-tabaseDofnobjectsintoasetofkclusters.kisaninputpa-rameterforthesealgorithms,i.esomedomainknowledgeisrequiredwhichunfortunatelyisnotavailableformanyap-plications.ThepartitioningalgorithmtypicallystartswithaninitialpartitionofDandthenusesaniterativecontrolstrategytooptimizeanobjectivefunction.Eachclusterisrepresentedbythegravitycenterofthecluster(k-meansal-gorithms)orbyoneoftheobjectsoftheclusterlocatednearitscenter(k-medoidalgorithms).Consequently,partitioningalgorithmsuseatwo-stepprocedure.First,determinekrep-resentativesminimizingtheobjectivefunction.Second,as-signeachobjecttotheclusterwithitsrepresentative“clos-est”totheconsideredobject.Thesecondstepimpliesthatapartitionisequivalenttoavoronoidiagramandeachclusteriscontainedinoneofthevoronoicells.Thus,theshapeofallADensity-BasedAlgorithmforDiscoveringClustersinLargeSpatialDatabaseswithNoiseMartinEster,Hans-PeterKriegel,JörgSander,XiaoweiXuInstituteforComputerScience,UniversityofMunichOettingenstr.67,D-80538München,Germany{ester|kriegel|sander|xwxu}@informatik.uni-muenchen.dePublishedinProceedingsof2ndInternationalConferenceonKnowledgeDiscoveryandDataMining(KDD-96)clustersfoundbyapartitioningalgorithmisconvexwhichisveryrestrictive.Ng&Han(1994)explorepartitioningalgorithmsforKDDinspatialdatabases.AnalgorithmcalledCLARANS(ClusteringLargeApplicationsbasedonRANdomizedSearch)isintroducedwhichisanimprovedk-medoidmeth-od.Comparedtoformerk-medoidalgorithms,CLARANSismoreeffectiveandmoreefficient.Anexperimentalevalu-ationindicatesthatCLARANSrunsefficientlyondatabasesofthousandsofobjects.Ng&Han(1994)alsodiscussmeth-odstodeterminethe“natural”numberknatofclustersinadatabase.TheyproposetorunCLARANSonceforeachkfrom2ton.Foreachofthedi
本文标题:a-density-based-algorithm-for-discovering-clusters
链接地址:https://www.777doc.com/doc-5404054 .html