您好,欢迎访问三七文档
当前位置:首页 > 商业/管理/HR > 企业文化 > 网络环境下的大规模内容计算-PowerPointPre
网络环境下的大规模内容计算------WebSearchandWebMining程学旗,cxq@ict.ac.cn中国科学院计算技术研究所06.8.17SWCL2006沈阳OutlineBackgroundandMotivationSomeofexistingworksinICT–Someofexistingresearchworks–SomeofsharingsystemsConclusionABigProblem!抛开争论看Web搜索的发展UnifiedBrowsingUnifiedSearchPersonalizedSearchPersonalizedSpaceWebMining:海量信息空间内的知识发现对象:大规模复杂网络信息–网页达到100亿;每天PB规模的邮件流量;10多亿以上的即时通信用户,同时在线数亿用户;每年近3000亿条的手机短信,每天平均近10亿条目的:准确、及时有效地知识发现–商业情报智能:非完整信息下的确定性判定–垃圾信息的过滤:猫与老鼠的游戏–金融证券信息的监管:–反恐、网络犯罪的发现挑战:发现难!–如何找到最想要的信息,而不是一堆垃圾–如何发现和跟踪最有价值的线索?–如何实时响应数据流?–如何发现异常?“Web2.0”:热闹背后有什么?行为模式的变化–Architecture:Fromserver-centeredtoPeer-distributed–互动参与:P2P,Blog状态特征的变化–Streaming:FromINFORMATIONtoMESSAGE–Socialization:内容表征的变化:RichContent–多源头、大规模–RichDimensionsMotivation:“问题还是那个问题,目标不再是那个目标”模型表示与特征获取:“单一的特征空间既不是完备的,又不是正交的”AssumptionsofVSM,PM,LMetcHowtorepresenttheRichdimensionalfeaturespaces?可计算性:“统一性排名不是大多数需求,个性化才是大多数”UnifiedRANKINGhassomanybiases!Identityvs.Otherness(Active-computing)SpecialalgorithmsforRich-dimensionalfeaturespaceStreamingMessagevs.Text/Sentence;Dynamic“context”sensitiveTradeoffbetweendeepunderstandingandperformanceShadowandefficientlanguageprocessingOutlineBackgroundandMotivationSomeofexistingworksinICT–Someofexistingresearchworks–SomeofsharingsystemsConclusionOrganizationsofICTDoIS前瞻中心网络与普适计算系统结构智能软件BioinformaticsIntelligentInformationProcessingGridandServiceComputingInformationIntelligence&infosecurityIR、WebMining、ShadowLanguageProcessing、DataStringmanagement、P2PcomputingNetworkSecurity、DRMandTrustComputingICTI3SAbout80personsinI3S–About25researchfaculty–Morethan40studentsOver20Ph.dcandidates,over15mastercandidatesRelatedworksinI3S@ICTResearchtopics大规模文本分析与网络挖掘:Dr.许洪波等确定性(浅层)自然语言处理:Dr.张华平等网络搜索:Dr.王斌、张刚etc大规模特征匹配、数据流挖掘:Dr.谭建龙等网络结构挖掘与社会计算:Dr.陈海强等P2P计算:Dr.吕建明等SharableSystems中文分词与词法分析软件:ICTCLAS高性能全文索引与检索平台:FirteXDataStreamManagementConditions:–Highspeedstreaming(Over10GBps)–LargeScalequeries(Over100,000)–EmergenceoftemporalunknownpatternsRequirement–Onlineresponding–EmergencepredictionChallenges},...,,{21msssS...},,{321dddD})(,|{'truedsnjsSijj数据流QueryProcessing–Multiplefilteringqueriesprocessingonsinglestream–JoinalgorithmsonmultiplestreamsDataStreammining–Frequentpatternsdiscovery–Clustering–Emergenceprediction–…Whatwearepursuing数据流MultipleStringsMatchingClassicAlgorithms:–Prefix-basedapproach:KMP,AC,Shift-And,Shift-Or–Suffix-basedapproach:Boyer-Moore,Wu-Manber–Factor-basedapproach:SBDM,SBOMChallenge–Thenumberoffeaturestringsincreasewiththerapidgrowofinformationscale.(ClamAntiViruslibrary:26653)–TraditionalStringmatchingalgorithmcannotsolvetheproblemwhilethefeaturenumberisover5000.国家主干网的网络流量增长图串匹配算法速度随特征串数量的变化图)](___[321211cppcpcninstructioperaccessesmemoryCPIICT改进算法时间复杂度优化算法的数据结构问题的核心:时间优化与空间优化数据流Partition:CombinatorialOptimizationMatching(ICT-COM)},,...,,{1)()(1)()(PMPMPmPmNNNNViNFindtheoptimalpartitionFindtheshortestpathinaweightedgraph–Edge:asetofblockswithlengthgreaterthanorequalwithi,butlessthanj),(jiNN–Weight:theminimaltimeoftheclassicalalgorithmstosearchinatrainingtextforthekeywordsinthecorrespondingsubsetObjective:–findtheshortestpathfromsourcetosinkinG),(jiNNWsourcesinkConstructaweightedgraphGaccordingtothegivenkeywordssetPasfollows–Node:eachablockwithlengthiinP数据流OptimizationAnalysis–4subsetsweregivenbyCOMandassignedwithdifferentalgorithms.3-9(AC),10-13(SBOM),14-35(SBOM),36-210(SBOM)–ThespeedofCOMisabout3timesfasterthanthequickestclassicalone.–ICT-COMisanefficientlarge-scalestringmatchingalgorithm.ResultsofICT-COMClamav签名中关键词的统计信息0200400600800100012001400160032139577593111129147168关键词长度(单位:字节)关键词个数使用Clamav数据的算法速度比较010203040501算法扫描匹配速度(MB/s)WuManberBOMACCOM_DCOM_S2345687LIUPing,etc,APartition-BasedEfficientAlgorithmforLargeScaleMultiple-StringsMatching,IEEESPIRE2005数据流LexicalProcessingDifficultiesinChineselexicalanalysis–SegmentationOverlappedambiguitiesCombinationambiguities–UnknownwordsrecognitionNamedentities:PER,LOC,ORG,etc.Newwords–POStagging语言处理HHMMArchitectureinICTCLASIIIHHMMArchitecture:TraceStringAtomSegmentationNSP-basedroughsegmentation5thHMMAtomsequenceTopnsequenceWordssequenceLexicalresultsPOSsequenceSimpleunknownwordsrecognitionPERLOCWordgraph4thHMMComplexunknownwordsrecognition3rdHMMClass-basedfinalsegmentation2thHMMRevisedNresultsLOCORG1thHMMPOSTaggingHHMM-basedChineselexicalanalysis语言处理Class-basedsegmentationci=wiiffwiislistedinthesegmentationlexicon;PER,LOC,ORG,TIMEorNUMiffwiisanunknownnamedentity;STRiffwiisanunknownsymbolstring;BEGiffbeginningofasentenceENDiffendingofasentenceOTHERotherwise.WordclassdefinitionClass-basedsegmentationmodel语言处理Role-basedUnknownwordrecognitionUnknownwordsrecognition:role-basedHMM毛/Surname泽/Mid_name东/last_name1893年/context诞生/remote_contextProbabilityP(Wi|Ci)ofrecognizedunknownwordscouldbeestimatedinrole-basedHMMHuapingZhangetc,ChineseNamedEntityRecognitionUsingRoleModel,InternationalJournalofComputationalLinguisticsandChineseLanguageProcessing,2003,Vol.8(2)语言处理ChineseNewWordIdentificationUnknownwordsornewwordsblastwit
本文标题:网络环境下的大规模内容计算-PowerPointPre
链接地址:https://www.777doc.com/doc-922707 .html