您好,欢迎访问三七文档
当前位置:首页 > 商业/管理/HR > 管理学资料 > 采用URL特征的Hub网页识别方法研究
:,ORCID:0000-0001-6640-4460,E-mail:smiling_boy@163.com*“”(:61171159)URLHub*11,221(TRS100085)2(100101),URL,URL,(SVM),91.2%,,60%URL,,URLHubTP391.1G351,Web,,,8.52%[1],,,,,,URL,,,Web,,(Hub)(Topic)[2],Hub,,,[3],Hub[4],HubURL,Hub,URLHub,URLHub,Hub,2Hub[4][5-6][7-9]HubURL,,,HubMeng,indexclassdefaultHub[4],Hub,26620161XIANDAITUSHUQINGBAOJISHU25:(1)Hub,,Hub;(2)HubHub,Hub,Ail,[5]Hub,,Hub,,,,,,,,[9],HTML,,,Hub,,,HTML,,,,Hub,,URL,URL,SVM,,,,,,URL,URL,,3URLHub3.1SVM(SupportVectorMachines,SVM)VapnikVC,,SVM[10-12]:,()Mercer,,,:(1),,(2)SVM,,“”,(3)SVM,,,“”(4),“”,,“”3.2Hub,Hub,,HubHubURLHubURLHub:URL,,Hub;,SVM,;SVM,,SVM3.31URLHub,,:1Hub(1)URLURL,,URLURLURLURLURL,,URL,,URL,,URLURLURLURLURL(2),,HubURL,URL,Hub:URL:,Hub,URL:Hub,HubURLURL:,URL,Hub:HubURL:,;“index”“class”:,HubASPJSPASPXPHP:URL,URLID,HubURL:HubURL:HubURL:URLHub,:URLHub,,Hub,,URL,,,,500Hub,,100(),:“”302,0.604,60.4;“class”“index”“default”“list”153,0.306,30.6;“article”“content”0,0,0;45,0.09,9“”302,0.604,60.4;“asp”“jsp”“aspx”“php”123,0.246,24.6;“shtml”“html”“htm”75,0.15,15;0,0,0“”412,0.824,82.4;“id”52,0.104,10.4;36,0.072,7.2(3),URL,LibSVM[13]URLLibSVMSVM,,LibSVM-3.20Java,,~cjlin/libsvm/.26620161XIANDAITUSHUQINGBAOJISHU27LibSVM:[label][index1]:[value1][index2]:[value2]…[label][index1]:[value1][index2]:[value2]…,label(class),;index,1;value,0,,index,,svmscale,[–1,1],lowerupperRBFSVMC-SVC,C,cc,,,;c,,,RBF,:RBF,RBF,RBF,;,;,RBFRBFgamma,,SVMtrain,SVM:svm_typec_svc%SVM,C-SVCkernel_typerbf%,RBFgamma0.0769231%gamma,1/knr_class2%,total_sv132%rho0.424462%label10%nr_sv6468%SV%11:0.1666672:13:-0.3333334:-0.4339625:-0.3835626:-17:-18:0.06870239:-110:-0.90322611:-112:-113:10.51048321289851641:0.1252:13:0.3333334:-0.3207555:-0.4063936:17:18:0.08396959:110:-0.80645212:-0.33333313:0.5cg(c,ggamma)10,9,,10,10,,LibSVM,,,,,,,,,1000,80%;20003000,91%,,2000,(91%),c32,g0.0625cgSVMSVMtrain,LibSVM,,,X,SVMY44.1,URLHub:;URL,[6]URL,,,,,4.2(1)ScriptCSS,,Hub,,,500Hub,,0.6,–0.2,,,,(2)HTMLDOMHTMLHTML:1)HTMLHTML,,,2)DOMHTMLDOM3)stylescriptappletHTML,,4),:URLURL,,8,:URLURLURL,SVM55.1:,,CPU;(Precision),,,(Recall),,,,,,,F1,Precision,RecallF1,Precision,RecallF11TPPrecisionTPFP=+(1)TPRecallTPFN=+(2)2PrecisionRecallF1PrecisionRecall××=+(3)1HubYesNoYesTPFPNoFNTN5.2,50,50:(10)(10)(10)(10)(10),,500,4,300,,,100020003000,,Hub,,,301000,600Hub400,26620161XIANDAITUSHUQINGBAOJISHU29Win7,CPUIntel,2GB5.3URLHub2URLHub,Precision91.20%,Recall86.33%,F188.70%91%2URLHubHubYesNoYes51850No823503,–0.20.6,–0.1,,Precision86.63%,Recall83.17%,F184.86%,[6]3HubYesNoYes49977No1013234,Precision88.73%,Recall90.50%,F189.61%,[9]4HubYesNoYes54369No573315,CPU5.4URLHub5/s/MBCPU1100079.611251%2100087.512859%3URL100021.33617%,,91%,,91.2%,,2:2URLHub,:,;;URLHub,,URLHub,,URL,,:HubURL,URLURL,URLURL,,,URL,,;Hub,,,Hub,,,,,,5,URLHub,,(70%)URL,URL,HTML,HTML;CPU,60%,URL,,,URLHub,6URLHub,URL,,,60%,URL,URL,,,URL,[1],,.Web[J].,2005,24(4):398-406.(MengTao,YanHongfei,WangJimin.CharacterizingTemporalLocalityinChangesofWebDocuments[J].JournaloftheChinaSocietyforScientificandTechnicalInformation,2005,24(4):398-406.)[2],,.[M].:,2005.(LiXiaoming,YanHongfei,WangJimin.SearchEngine:Theory,TechnologyandSystem[M].Beijing:SciencePress,2005.)[3]ChoJ,Garcia-MolinaH.TheEvolutionoftheWebandImplicationsforanIncrementalCrawler[C].In:Proceedingsofthe26thInternationalConferenceonVeryLargeDataBases,2002.[4]MengT,YanH,WangJ,etal.TheEvolutionofLink-attributesforPagesandItsImplicationsonWebCrawling[C].In:Proceedingsofthe2004IEEE/WIC/ACMInternationalConferenceonWebIntelligence,2004.[5]AliR,BegNMS.AnOverviewofWebSearchEvaluationMethods[J].Computers&ElectricalEngineering,2011,37(6):835-848.[6].[D].:,2013.(CaoGuifeng.DesignandImplementofWebpageClassifyandCleaninSearchEngine[D].Wuhan:WuhanUniversityofTechnology,2013.)[7]ZhangX,ZhouM,GengG,etal.ACombinedFeatureSelectionMethodforChineseTextCategorization[C].In:Proceedingsofthe2009InternationalConferenceonInformationEngineeringandComputerScience,2009.[8].[D].:,2007.(XieGuanghua.ResearchandApplicationofChineseWebPageAutomaticClassification[J].JournalofDalianUniversityofTechnology,2007.)[9]WangRJ,WangDJ.WebInformationAcquisitionbyPersonalSearchEngineBasedonSVM[J].InternationalJournalofInformationAcquisition,2005,2(4):345-352.[10],,.[J].,2001,18(9):23-26.(PangJianfeng,BuDongbo,BaiShuo.ResearchandImplementationofTextCategorizationSystemBasedonVSM[J].ApplicationResearchofComputers,2001,18(9):23-26.)[11],,,.[J].,2004,24(4):58-61.(LiLiang,LiuWanchun,XuQuanqing,etal.AProfessionalChineseWebPageClassifierBasedonSupportVectorMachine[J].ComputerApplication,2004,24(4):58-61.)[12].[J].,2000,26(1):32-42.(ZhangXuegong.IntroductiontoStatisticalLearningTheoryandSupportVectorMachines[J].ActaAutomaticaSinica,2000,26(1):32-42.)[13]ChangCC,LinCJ.LIBSVM:ALibraryforSupportVectorMachines[J].TransactionsonIntelligentSystemsand26620161XIANDAITUSHUQINGBAOJISHU31Technology,2011,2(3):ArticleNo.27.[14]JiangJ,SongX,YuN,etal.Focus:LearningtoCrawlWebForums[J].IEEETransactionsonKnowledgeandDataEngineering,2013,25(6):1293-1306.[15]LeA,Markopoul
本文标题:采用URL特征的Hub网页识别方法研究
链接地址:https://www.777doc.com/doc-5588905 .html