您好,欢迎访问三七文档
当前位置:首页 > 电子/通信 > 数据通信与网络 > 基于XML的Web信息抽取技术研究
沈阳理工大学硕士学位论文基于XML的Web信息抽取技术研究姓名:范春晓申请学位级别:硕士专业:计算机软件与理论指导教师:和晓军20100301InternetWebWebWebHTMLXMLXMLXHTMLHTMLXMLHTMLXMLXHTMLXMLHTMLHTMLXHTMLXHTMLDOMXMLDOMXMLWebAbstractWiththerapiddevelopmentoftheInternet,theWebdatainformationissharpincreasing,whichbecomesthebiggestinformationsourcebeyondtheothersources.Consequently,howtoextractvaluableinformationformwebhasbecomearesearchfocalpoint.Currently,amassofWebinformationwillbeshowedintheinformationdisplaypagewhichismainmedia,sothereseachofsuchpageshasbecomeextremelysignificantandpractical.HTMLisverysuccessfulinthedisplaydata,anditfocusesontheperformanceofthedata,ratherthanadescriptionofthedata,soaccordingtolabel,wecannotgainthecontentitcontainsthroughlabel.XMLisanewtechnologythatfocusesonoperatingthedata,asaresult,ithasgreatadvantagestoextractdatabyXMLtechnology.XHTMLprovidesabrigdeforthem,anditcanconvertHTMLtoXHTMLwhichmeetstheXMLtechnicalnorms.ThankstousingHTMLtechnologyinanarmyofWebpage,inthisthiese,extractdataofinformationdisplaypagetakingadvantageofXML-relatedtechnologies.Itssolutionis:Firstly,Accesstotargetinformationdisplaypageandcleaningthispage,thenthecleanedHTMLsourceisconvertedintostructuredXHTMLdocumentbyNtidytool.Secondly,ExtractmaindatablockviaempoweringthevaluetoDOMtreenodeandgeneratedatarecord;Finally,choosethebestusefulinformationthroughXML-basedfieldvocabularyandthenumberofwordsinthedatarecord,andstorethebestdatarecord.Inthisthiese,reseacheshavebeendoneonrelatedtechnologyofinformationextraction.Accordingtothefeatureofinformationdisplaypages,weproposeinformationextractionmethodandestablishamodelofExperiment.Duringthecourseofextratinginformation,wechooserationalvalueformaindatablock,soitcangetridofthenoiseinformation;wealsoadoptthemethodofsecondrecognizevalue,toextractinfortionexactly.Theexperimentsshowthatthismethodobtainedgoodresultsinrecallratioandaccuracyrate.KeywordsWebInformationExtraction,XML,InformationDisplayPage,WeightCoefficient:1-1-11.12009113CNNIC232008287.8160460217386099KBWeb[1]WebWebWebHTMLWebHTMLWebSQLWebWeb1.2Web[2]MUC-2-[3](1)[4]NENamedEntityRecognition,NE(2)[5]METMultilingualEntityTaskMET(3)[6]TETemplateElementTE(4)COCoreferenceCOCO(5)STScenarioTemplateSTST1.3[7]HTMLInternet[8]HiddenWeb1-3-XMLXML1.4XMLXMLWebHTMLWebHTMLXMLXHTMLDOMDOMXMLHTMLHTMLHTML-TidyDOMXML1.1Web(1)Web(2)(3)XML(4)1.5Web-4-5WebWebHTMLXMLXHTMLDOMXMLWebWebXMLXTWIEXMLbasedTwiceWeightInformationExtractionXTWIEXML2-5-22.1WebWeb(WebInformationExtraction)WebWebIntemetHTMLWebWebWeb2.1.1WebWeb[9](1)[10]HTML(2)DOM[11]XMLDOM(3)Web[12](4)[13]HTMLHTML-6-2.1.2WebWeb(1)[14]NLPWebWeb(2)[15](3)Ontology[16](4)HTML[17]WebHTML(5)Web[18]WebWeb(6)2.1.3WebWeb2-7-2.22.2.1HTMLHTMLHyperTextMarkupLanguageW3CIntenretHTMLWebHTML[19]W3CHTMLHTMLHTMLHTMLHTMLW3CWebHTMLHTMLbodybodyHTMLHTMLHTML2.2.2XMLXMLeXtensibleMarkupLanguage[20]-8-XML[21]well-formedXMLXMLW3CXML1.0XML[22](1)XML(2)XMLbookProfessionalXML/book(3)(4)“/”imagefile=“SimpleXML.jpg”/(5)(4)fileSimpleXML.jpg(6)XML()XML>XML[23]2.12.1XMLXMLXMLXML[24]HTMLHTML2-9-HTMLpliHTMLXMLXHTMLXML2.2.3DOMDOMDOMXMLDOM[25](1)XML(2)XML(3)(4)W3CXMLDOMXMLXMLDOMXMLDOM(1)(2)XML(3)XML(4)XML(5)XMLW3CDOMDOMDOM2.1XMLDOMDOMXMLDocumentDOM“DocumentXML”ElementbooksElementbookDocumentNodeownerDocumentDocument-10-XML:booksorder:classificatione=computer:Book:Book:Book:WebDataMining:InformationExtraction:PerfessionalXML2.2DOMXMLElementDOMbookPerfessionalXML/bookbookbookPerfessionalXMLText“TextWebDataMining”Attr“Attrname=”computer””DOMXMLnode-treeDOMDOMDOMDOMParentNodeChildrenNode/SiblingNode(1)DOM(2)(3)(4)(5)2.2.4XHTMLXHTMLeXtensibleHyperTextMarkupLanguage2-11-XHTMLXMLHTMLHTMLSGMLXMLXHTMLXMLSGMLW3C2000126XHTML1.0[26]HTMLXMLHTML4.0XMLXHTMLXHTMLHTMLXHTMLHTMLXHTMLXHTMLHTML(1)HTMLHTMLPAD(2)XHTMLXMLXMLXHTMLHTMLXML(3)XHTMLXMLXHTMLXMLXHTMLHTML2.3WebInternetWebGoogleBaiduWebWeb2.3Web-12-2.3WebWeb[27](1)HTML[28]HTMLDOMHTMLHTMLHTMLHTMLHTML2.4“”2-13-PULOLDLDTRDIVLTDTDDTHH1H2H3H4H5H6ADDRESSMENUCENTERCAPTIONTRTDBLOCKQOUTETABLE2.4HTMLHTML(2)DOM[29]DOMDOMHTMLDOMHTMLHTMLHTML-DOMHTMLtagDOMHTMLDOM(3)[30]HTMLtableolultabletabletableHTMLHTMLtableKtableK-14-(4)[31]DOMWebDOMVIPSCaiDengWebWebWebWebDOMVIPSHTML-DOMDOCVIPS2.5DOMVIPSDOMDOM2-15-2.5VIPSWebWebDOCHTMLW3Ctable-16-XML2.4XMLXMLXMLXMLXMLXMLXMLXMLXML[32](1)InternetOpenBuyingontheInternetConsortiumOBIInternetv2.0ASCX12XMLAribaMicrosoftCommercecXMLCommerceOneCommonBusinessLibraryCBLXML(2)uwi.comExtensibleFormsDescriptionLanguage2-17-XFDLXML(3)XMLXML1987ANSIX127HealthLevel7HL7XML(4)XMLIBMBeanMarkupLanguageBMLXMLJavaBeanBMLJavaBeanJavaBeanJavaBeanJavaBeanOpenSoftwareDescriptionOSDMarimbaMicrosoftUnifiedModelingLanguageUML——“ML”XMLMetaObjectFacilityMOFObjectManagementGroupOMGOMGIBMUnisysOracleRationalSybaseXMLXMLMetadataInterchangeXMIXMLUMLMOFXMLXML7SignalingSystem7SS7XMLCallPolicyMarkupLanguageCPML-18-IPVoice-over-IPXMLXMLXML2.5WebHTMLXHTMLHTMLXMLDOMDOMWebXML3Web-19-3WebHTMLHTMLHTMLHTML-TidyXHTMLDOMDOM3.1[33,34,35]WebDOMXML[36]DSEdata-richsectionextractionV
本文标题:基于XML的Web信息抽取技术研究
链接地址:https://www.777doc.com/doc-6226332 .html