您好,欢迎访问三七文档
当前位置:首页 > 电子/通信 > 数据通信与网络 > TextMining10-信息抽取
1信息抽取杨建武Email:yangjw@pku.edu.cn第十章:北京大学计算机科学技术研究所文本挖掘技术(2012春)2WhatisInformationExtraction?InformationRetrievalYouhaveaninformationneed,butwhatyougetbackisntinformationbutdocuments,whichyouhopehavetheinformationInformationextractionItisoneapproachtogoingfurtherforaspecialcase:•Theressomerelationyoureinterestedin•Yourqueryisforelementsofthatrelation•AlimitedformofnaturallanguageunderstandingThegoalofInformationextraction(IE)istransformtextintoastructuredformat(e.g.databaserecords)accordingtoitscontent3InformationExtractionofSeminarAnnouncements4InformationExtractionofSeminarAnnouncements5InformationExtractionAsAnAnnotationTask6Dataautomaticallyextractedfrommarketsoft.comExtractingCorporateInformationSourcewebpage.Colorhighlightsindicatetypeofinformation.(e.g.,red=name)7Productinformation8LandscapeofIETasksClosedsetHewasborninAlabama…RegularsetPhone:(413)545-1323ComplexpatternUniversityofArkansasP.O.Box140Hope,AR71802…wasamongthesixhousessoldbyHopeFeldmanthatyear.Ambiguouspatterns,needingcontextandmanysourcesofevidenceTheCALDmainofficecanbereachedat412-268-1299ThebigWyomingsky…U.S.statesU.S.phonenumbersU.S.postaladdressesPersonnamesHeadquarters:1128MainStreet,4thFloorCincinnati,Ohio45210PawelOpalinski,SoftwareEngineeratWhizBangLabs.E.g.wordpatterns:9LandscapeofIETasksSingleentityPerson:JackWelchBinaryrelationshipRelation:Person-TitlePerson:JackWelchTitle:CEON-aryrecord“Namedentity”extractionJackWelchwillretireasCEOofGeneralElectrictomorrow.ThetoproleattheConnecticutcompanywillbefilledbyJeffreyImmelt.Relation:Company-LocationCompany:GeneralElectricLocation:ConnecticutRelation:SuccessionCompany:GeneralElectricTitle:CEOOut:JackWelshIn:JeffreyImmeltPerson:JeffreyImmeltLocation:Connecticut10难点Textualinconsistency例:digitalcamerasImageCaptureDevice:1.68millionpixel1/2-inchCCDsensorImageCaptureDeviceTotalPixelsApprox.3.34millionEffectivePixelsApprox.3.24millionImagesensorTotalPixels:Approx.2.11million-pixelImagingsensorTotalPixels:Approx.2.11million1,688(H)x1,248(V)CCDTotalPixels:Approx.3,340,000(2,140[H]x1,560[V])EffectivePixels:Approx.3,240,000(2,088[H]x1,550[V])RecordingPixels:Approx.3,145,000(2,048[H]x1,536[V])Theseallcameoffthesamemanufacturerswebsite!!Andthisisaverytechnicaldomain.11评价TemplateMeasureforeachtestdocument:Totalnumberofcorrectextractionsinthesolutiontemplate:NTotalnumberofslot/valuepairsextractedbythesystem:ENumberofextractedslot/valuepairsthatarecorrect(i.e.inthesolutiontemplate):CComputeaveragevalueofmetricsadaptedfromIR:Recall=C/NPrecision=C/EF-Measure=Harmonicmeanofrecallandprecision12ThreegenerationsofIEsystemsHand-BuiltSystemsKnowledgeEngineering[1980s]RuleswrittenbyhandRequireexpertswhounderstandboththesystemsandthedomainIterativeguess-test-tweak-repeatcycleAutomatic,TrainableRule-ExtractionSystems[1990s]RulesdiscoveredautomaticallyusingpredefinedtemplatesRequirehuge,labeledcorpora(effortisjustmoved!)MachineLearning(Sequence)Models[1997]Onedecodesastatisticalmodelthatclassifiesthewordsofthetext,usingHMMs,randomfieldsorstatisticalparsersLearningusuallysupervised;maybepartiallyunsupervised13FiniteStateMachinesContextFreeGrammarsBoundaryModelsAbrahamLincolnwasbornin……Classifierwhichclass?BEGINENDBEGINENDBEGINAbrahamLincolnwasborninKentucky.Mostlikelystatesequence?AbrahamLincolnwasborninKentucky.NNPVPNPVNNPNPPPVPVPSRepresentationModels14包装器Wrappers15WrappersIfwethinkofthingsfromthedatabasepointofviewWewanttobeabletodatabase-stylequeriesButwehavedatainsomehorridtextualform/contentmanagementsystemthatdoesntallowsuchqueryingWeneedtowrapthedatainacomponentthatunderstandsdatabase-stylequeryingManypeoplehavewrappedmanywebsitesCommonlysomethinglikeaPerlscriptOfteneasytodoasaone-offButhandcodingwrappersinPerlisntveryviableSitesarenumerous,andtheirsurfacestructuremutates(变异)rapidly(around10%failureseachmonth)16AmazonBookDescription….bclass=sansTheAgeofSpiritualMachines:WhenComputersExceedHumanIntelligence/bbrfontface=verdana,arial,helveticasize=-1byahref=/exec/obidos/search-handle-url/index=books&field-author=Kurzweil%2C%20Ray/002-6235079-4593641RayKurzweil/abr/fontbrahref===90height=140align=leftborder=0/afontface=verdana,arial,helveticasize=-1spanclass=smallbListPrice:/bspanclass=listprice$14.95/spanbrbOurPrice:fontcolor=#990000$11.96/font/bbrbYouSave:/bfontcolor=#990000b$2.99/b(20%)/fontbr/spanpbr…17ExtractedBookTemplateTitle:TheAgeofSpiritualMachines:WhenComputersExceedHumanIntelligenceAuthor:RayKurzweilList-Price:$14.95Price:$11.96::18TemplateTypesSlotsintemplatetypicallyfilledbyasubstringfromthedocument.Someslotsmayhaveafixedsetofpre-specifiedpossiblefillers(可能的填充值)thatmaynotoccurinthetextitself.Terroristact:threatened,attempted,accomplished.Jobtype:clerical,service,custodial,etc.Companytype:SECcodeSomeslotsmayallowmultiplefillers.ProgramminglanguageSomedomainsmayallowmultiple
本文标题:TextMining10-信息抽取
链接地址:https://www.777doc.com/doc-6381876 .html