基于XML的Web信息抽取技术研究

沈阳理工大学硕士学位论文基于XML的Web信息抽取技术研究姓名：范春晓申请学位级别：硕士专业：计算机软件与理论指导教师：和晓军20100301InternetWebWebWebHTMLXMLXMLXHTMLHTMLXMLHTMLXMLXHTMLXMLHTMLHTMLXHTMLXHTMLDOMXMLDOMXMLWebAbstractWiththerapiddevelopmentoftheInternet,theWebdatainformationissharpincreasing,whichbecomesthebiggestinformationsourcebeyondtheothersources.Consequently,howtoextractvaluableinformationformwebhasbecomearesearchfocalpoint.Currently,amassofWebinformationwillbeshowedintheinformationdisplaypagewhichismainmedia,sothereseachofsuchpageshasbecomeextremelysignificantandpractical.HTMLisverysuccessfulinthedisplaydata,anditfocusesontheperformanceofthedata,ratherthanadescriptionofthedata,soaccordingtolabel,wecannotgainthecontentitcontainsthroughlabel.XMLisanewtechnologythatfocusesonoperatingthedata,asaresult,ithasgreatadvantagestoextractdatabyXMLtechnology.XHTMLprovidesabrigdeforthem,anditcanconvertHTMLtoXHTMLwhichmeetstheXMLtechnicalnorms.ThankstousingHTMLtechnologyinanarmyofWebpage,inthisthiese,extractdataofinformationdisplaypagetakingadvantageofXML-relatedtechnologies.Itssolutionis:Firstly,Accesstotargetinformationdisplaypageandcleaningthispage,thenthecleanedHTMLsourceisconvertedintostructuredXHTMLdocumentbyNtidytool.Secondly,ExtractmaindatablockviaempoweringthevaluetoDOMtreenodeandgeneratedatarecord;Finally,choosethebestusefulinformationthroughXML-basedfieldvocabularyandthenumberofwordsinthedatarecord,andstorethebestdatarecord.Inthisthiese,reseacheshavebeendoneonrelatedtechnologyofinformationextraction.Accordingtothefeatureofinformationdisplaypages,weproposeinformationextractionmethodandestablishamodelofExperiment.Duringthecourseofextratinginformation,wechooserationalvalueformaindatablock,soitcangetridofthenoiseinformation;wealsoadoptthemethodofsecondrecognizevalue,toextractinfortionexactly.Theexperimentsshowthatthismethodobtainedgoodresultsinrecallratioandaccuracyrate.KeywordsWebInformationExtraction,XML,InformationDisplayPage,WeightCoefficient:1-1-11.12009113CNNIC232008287.8160460217386099KBWeb[1]WebWebWebHTMLWebHTMLWebSQLWebWeb1.2Web[2]MUC-2-[3](1)[4]NENamedEntityRecognition,NE(2)[5]METMultilingualEntityTaskMET(3)[6]TETemplateElementTE(4)COCoreferenceCOCO(5)STScenarioTemplateSTST1.3[7]HTMLInternet[8]HiddenWeb1-3-XMLXML1.4XMLXMLWebHTMLWebHTMLXMLXHTMLDOMDOMXMLHTMLHTMLHTML-TidyDOMXML1.1Web(1)Web(2)(3)XML(4)1.5Web-4-5WebWebHTMLXMLXHTMLDOMXMLWebWebXMLXTWIEXMLbasedTwiceWeightInformationExtractionXTWIEXML2-5-22.1WebWeb(WebInformationExtraction)WebWebIntemetHTMLWebWebWeb2.1.1WebWeb[9](1)[10]HTML(2)DOM[11]XMLDOM(3)Web[12](4)[13]HTMLHTML-6-2.1.2WebWeb(1)[14]NLPWebWeb(2)[15](3)Ontology[16](4)HTML[17]WebHTML(5)Web[18]WebWeb(6)2.1.3WebWeb2-7-2.22.2.1HTMLHTMLHyperTextMarkupLanguageW3CIntenretHTMLWebHTML[19]W3CHTMLHTMLHTMLHTMLHTMLW3CWebHTMLHTMLbodybodyHTMLHTMLHTML2.2.2XMLXMLeXtensibleMarkupLanguage[20]-8-XML[21]well-formedXMLXMLW3CXML1.0XML[22](1)XML(2)XMLbookProfessionalXML/book(3)(4)“/”imagefile=“SimpleXML.jpg”/(5)(4)fileSimpleXML.jpg(6)XML()XML>XML[23]2.12.1XMLXMLXMLXML[24]HTMLHTML2-9-HTMLpliHTMLXMLXHTMLXML2.2.3DOMDOMDOMXMLDOM[25](1)XML(2)XML(3)(4)W3CXMLDOMXMLXMLDOMXMLDOM(1)(2)XML(3)XML(4)XML(5)XMLW3CDOMDOMDOM2.1XMLDOMDOMXMLDocumentDOM“DocumentXML”ElementbooksElementbookDocumentNodeownerDocumentDocument-10-XML:booksorder:classificatione=computer:Book:Book:Book:WebDataMining:InformationExtraction:PerfessionalXML2.2DOMXMLElementDOMbookPerfessionalXML/bookbookbookPerfessionalXMLText“TextWebDataMining”Attr“Attrname=”computer””DOMXMLnode-treeDOMDOMDOMDOMParentNodeChildrenNode/SiblingNode(1)DOM(2)(3)(4)(5)2.2.4XHTMLXHTMLeXtensibleHyperTextMarkupLanguage2-11-XHTMLXMLHTMLHTMLSGMLXMLXHTMLXMLSGMLW3C2000126XHTML1.0[26]HTMLXMLHTML4.0XMLXHTMLXHTMLHTMLXHTMLHTMLXHTMLXHTMLHTML(1)HTMLHTMLPAD(2)XHTMLXMLXMLXHTMLHTMLXML(3)XHTMLXMLXHTMLXMLXHTMLHTML2.3WebInternetWebGoogleBaiduWebWeb2.3Web-12-2.3WebWeb[27](1)HTML[28]HTMLDOMHTMLHTMLHTMLHTMLHTML2.4“”2-13-PULOLDLDTRDIVLTDTDDTHH1H2H3H4H5H6ADDRESSMENUCENTERCAPTIONTRTDBLOCKQOUTETABLE2.4HTMLHTML(2)DOM[29]DOMDOMHTMLDOMHTMLHTMLHTML-DOMHTMLtagDOMHTMLDOM(3)[30]HTMLtableolultabletabletableHTMLHTMLtableKtableK-14-(4)[31]DOMWebDOMVIPSCaiDengWebWebWebWebDOMVIPSHTML-DOMDOCVIPS2.5DOMVIPSDOMDOM2-15-2.5VIPSWebWebDOCHTMLW3Ctable-16-XML2.4XMLXMLXMLXMLXMLXMLXMLXMLXML[32](1)InternetOpenBuyingontheInternetConsortiumOBIInternetv2.0ASCX12XMLAribaMicrosoftCommercecXMLCommerceOneCommonBusinessLibraryCBLXML(2)uwi.comExtensibleFormsDescriptionLanguage2-17-XFDLXML(3)XMLXML1987ANSIX127HealthLevel7HL7XML(4)XMLIBMBeanMarkupLanguageBMLXMLJavaBeanBMLJavaBeanJavaBeanJavaBeanJavaBeanOpenSoftwareDescriptionOSDMarimbaMicrosoftUnifiedModelingLanguageUML——“ML”XMLMetaObjectFacilityMOFObjectManagementGroupOMGOMGIBMUnisysOracleRationalSybaseXMLXMLMetadataInterchangeXMIXMLUMLMOFXMLXML7SignalingSystem7SS7XMLCallPolicyMarkupLanguageCPML-18-IPVoice-over-IPXMLXMLXML2.5WebHTMLXHTMLHTMLXMLDOMDOMWebXML3Web-19-3WebHTMLHTMLHTMLHTML-TidyXHTMLDOMDOM3.1[33,34,35]WebDOMXML[36]DSEdata-richsectionextractionV

基于XML的Web信息抽取技术研究

免费阅读已结束，点击付费阅读剩下 ... 页

阅读已结束，您可以下载文档离线阅读

幕墙工程竣工全套资料

国际结算第四章本票和支票

一篇企业经营者不可不读的序文

第一篇领导艺术实战

铁路货运业发展现代物流的分析与对策

企业战略规划Word文档

关于转发省委组织部

赢道成功创业者的28条戒律

《艰苦奋斗开拓创新》

物业班长行为规范考核内容表

相关文档

相关搜索