您好,欢迎访问三七文档
当前位置:首页 > IT计算机/网络 > 数据挖掘与识别 > stanford大学-大数据挖掘-RelationExtraction-18
CS345DataMiningMiningtheWebforStructuredDataOurviewofthewebsofar…WebpagesasatomicunitsGreatforsomeapplicationse.g.,ConventionalwebsearchButnotalwaystherightmodelGoingbeyondwebpagesQuestionansweringWhatistheheightofMtEverest?WhokilledAbrahamLincoln?RelationExtractionFindallcompany,CEOpairsVirtualDatabasesAnswerdatabase-likequeriesoverwebdataE.g.,FindallsoftwareengineeringjobsinFortune500companiesQuestionAnsweringE.g.,WhokilledAbrahamLincoln?NaïvealgorithmFindallwebpagescontainingtheterms“killed”and“AbrahamLincoln”incloseproximityExtractk-gramsfromasmallwindowaroundthetermsFindthemostcommonlyoccuringk-gramsQuestionAnsweringNaïvealgorithmworksfairlywell!SomeimprovementsUsesentencestructuree.g.,restricttonounphrasesonlyRewritequestionsbeforematching“WhatistheheightofMtEverest”becomes“TheheightofMtEverestisblank”ThenumberofpagesanalyzedismoreimportantthanthesophisticationoftheNLPForsimplequestionsReference:DumaisetalRelationExtractionFindpairs(title,author)WheretitleisthenameofabookE.g.,(Foundation,IsaacAsimov)Findpairs(company,hq)E.g.,(Microsoft,Redmond)Findpairs(abbreviation,expansion)(ADA,AmericanDentalAssociation)Canalsohavetupleswith2componentsRelationExtractionAssumptions:NosinglesourcecontainsallthetuplesEachtupleappearsonmanywebpagesComponentsoftupleappear“close”togetherFoundation,byIsaacAsimovIsaacAsimov’smasterpiece,theemFoundation/emtrilogyTherearerepeatedpatternsinthewaytuplesarerepresentedonwebpagesNaïveapproachStudyafewwebsitesandcomeupwithasetofpatternse.g.,regularexpressionsletter=[A-Za-z.]title=letter{5,40}author=letter{10,30}b(title)/bby(author)ProblemswithnaïveapproachApatternthatworksononewebpagemightproducenonsensewhenappliedtoanotherSopatternsneedtobepage-specific,oratleastsite-specificImpossibleforahumantoexhaustivelyenumeratepatternsforeveryrelevantwebsiteWillresultinlowcoverageBetterapproach(Brin)ExploitdualitybetweenpatternsandtuplesFindtuplesthatmatchasetofpatternsFindpatternsthatmatchalotoftuplesDIPRE(DualIterativePatternRelationExtraction)PatternsTuplesMatchGenerateDIPREAlgorithm1.RÃSampleTuplese.g.,asmallsetoftitle,authorpairs2.OÃFindOccurrences(R)OccurrencesoftuplesonwebpagesKeepsomesurroundingcontext3.PÃGenPatterns(O)LookforpatternsinthewaytuplesoccurMakesurepatternsarenottoogeneral!4.RÃMatchingTuples(P)5.ReturnorgobacktoStep2Occurrencese.g.,TitlesandauthorsRestricttocaseswhereauthorandtitleappearincloseproximityonwebpagelibFoundation/bbyIsaacAsimov(1951)url=order=[title,author](or[author,title])denoteas0or1prefix=“lib”(limittoe.g.,10characters)middle=“/bby”suffix=“(1951)”occurrence=(’Foundation’,’IsaacAsimov’,url,order,prefix,middle,suffix)PatternslibFoundation/bbyIsaacAsimov(1951)pbNightfall/bbyIsaacAsimov(1941)order=[title,author](say0)sharedprefix=bsharedmiddle=/bbysharedsuffix=(19pattern=(order,sharedprefix,sharedmiddle,sharedsuffix)URLPrefixPatternsmaybespecifictoawebsiteOrevenpartsofitAddurlprefixcomponenttopattern:libFoundation/bbyIsaacAsimov(1951):pbNightfall/bbyIsaacAsimov(1941)sharedurlprefix==(urlprefix,order,prefix,middle,suffix)GeneratingPatterns1.Groupoccurencesbyorderandmiddle2.LetO=setofoccurenceswiththesameorderandmiddlepattern.order=O.orderpattern.middle=O.middlepattern.urlprefix=longestcommonprefixofallurlsinOpattern.prefix=longestcommonprefixofoccurrencesinOpattern.suffix=longestcommonsuffixofoccurrencesinOExample:libFoundation/bbyIsaacAsimov(1951):pbNightfall/bbyIsaacAsimov(1941)order=[title,author]middle=“/bby”urlprefix=prefix=“b”suffix=“(19”Example:Foundation,byIsaacAsimov,hasbeenhailed…:Nightfall,byIsaacAsimov,tellsthetaleof…order=[title,author]middle=“,by”urlprefix=prefix=“”suffix=“,”PatternSpecificityWewanttoavoidgeneratingpatternsthataretoogeneralOneapproach:Forpatternp,definespecificity=|urlprefix||middle||prefix||suffix|Supposen(p)=numberofoccurencesthatmatchthepatternpDiscardpatternswheren(p)nminDiscardpatternspwherespecificity(p)n(p)thresholdPatternGenerationAlgorithm1.Groupoccurencesbyorderandmiddle2.LetO=asetofoccurenceswiththesameorderandmiddle3.p=GeneratePattern(O)4.Ifpmeetsspecificityrequirements,addptosetofpatterns5.Otherwise,trytosplitOintomultiplesubgroupsbyextendingtheurlprefixbyonecharacterIfalloccurencesinOarefromthesameURL,wecannotextendtheurlprefix,sowediscardOExtendingtheURLprefixSupposeOcontainsoccurencesfromurlsoftheform://=
本文标题:stanford大学-大数据挖掘-RelationExtraction-18
链接地址:https://www.777doc.com/doc-3278121 .html