您好,欢迎访问三七文档
当前位置:首页 > 行业资料 > 交通运输 > 肖仰华:基于知识图谱的用户画像关键技术
!• !– !– !– !– !– • !• !– !• !• !– !• !• !• • !– !– !– – !!• !– (!– ()!!• YagoWordNet,!FreeBase,!Probase,!NELL,!CYC,!DBPedia….CN;Dbpediakwfudan.edu.cn• • • !• !• PageRank!!– !– !axybcChrist,0.33pop,0.33food,0.33freedom,0.25love,0.25Christ,0.25fashion,0.25photography,0.5fashion,0.5fxa=0.5fxb=0.2axybcafter1roundmusic,0.5food,0.5food,0.665music,0.5Christ,0.24fashion,0.175pop,0.165photography,0.1freedom,0.075love,0.075fashion,0.275photography,0.2freedom,0.075love,0.075Christ,0.075fxx=1fyb=0.2fxc=0.3fyc=0.3Deqing!Yang,!Yanghua'Xiao,!An!Integrated!Tag!RecommendaJon!Algorithm!Towards!Weibo!User!Profiling,!(DASFAA%2015),weibo!95• – 15116830!– “”””!!• – 2015081305:00!– ::!– ””!!!• !– ;!– ;!– ;!!• !– .!!– .!• MAPNDCG@5NDCGexp@5Only!NER0.58990.67560.7409Only!1;step!associaJon0.74700.73420.7473Forward!RW0.77830.76230.8107!• !– ;!– china,!japan,!india,!korea!;!asian%country%– dinner,!lunch,!food,!child,!girl!;!meal,!child%– bride,!groom,!dress,!celebration!;!wedding%• !– Coverage:!– Minimality:!• !– !– isAProbase!webapplicationframeworkrailsstrutsframeworkmvcframeworkwebframeworkOrphan!djangoProbaseusingthemodelthatgivesthemaximalposteriorlikelihood,i.e.argmaxc2CP(x|c).Thisleadstotheshortestcodelengthforx.However,todecodethedata,wealsoneedtoknowwhichmodelisusedtoencodex.Wedescribetwopossibleschemes,namelytwo-partcodeanduniversalcode,forthispurpose.Two-partcode.Forawordisencodedbyconceptci,wealsoencodetheindexi.Sincewehaveoverall|C|concepts,eachindexcanbeencodedwithlog|C|bits.ApplyingtheprinciplesofMDL,wehave:CL(X,C)=L(C)+L(X|C)=Xci2CL(ci)+Xxi2XL⇤(xi|C)(8)whereL⇤(x|C)isthecodelengthforencodingindividualwordxgiventhepriorknowledgeofC:L⇤(x|C)=log|C|+minc2CL(x|c)(9)Theinputmaycontainoutliersthatshouldnotbesum-marizedtoconcepts.Forexample,{apple,banana,breakfast,dinner,pork,beef,bullet}aredirectobjectsoftheverbeat.Wemaysummarizethemintocon-cept{fruit,meal,meat}exceptforthelastwordbullet.Weneedtomakeachoice:eitherencodetheoutlierindepen-dentlyorencodeitwithsomeconceptc.WeusetheMDLprincipletomakethechoice.Thatis,wecalculatethecodelengthsyieldedbythetwooptionsandselecttheonewiththeshorterlength.ThisleadstoanewdefinitionofL⇤(x|C):L⇤(x|C)=min⇢L(x),encodedirectlylog|C|+L(x|c),encodeusingc2C(10)Becauseeachwordisencodedindependently,thecombina-tionoflocaloptimumsguaranteestheglobaloptimum.Usingthisschemeeachword(x)willbeassignedtotheconceptcwhichhasthemaximalposteriorprobabilityP(x|c).Universalcode.Alternatively,wemaygenerateauniversalmodel,whichmixesallthemodelsintoonemodel.Forex-ample,wecancreateauniversalmodelbasedonoccurrenceprobability,i.e.P(x|C)=Pc2CP(x|c)P(c).Theregretmeasure[Shtar’kov,1987]isusedtoevaluatedifferentuni-versalmodels.Foragivendataitemx,theregretforauni-versalmodelP(x|C)relativetotheoriginalmodelclassC,isdefinedas:R(x,P)= logP(x|C) minc2C{ logP(x|c)}(11)Intuitively,R(x,P)istheadditionalnumberofbitstoencodexusingdistributionPcomparedtousingoptimalmaximumlikelihoodmodel.Thebestuniversalmodelshouldminimizethemaximaladditionalbitsovertheentiredataspace,thatisminPmaxx2XR(x,P)(12)whereXisthedataspacethatindividualdataxresidesin.Itwasshown[Shtar’kov,1987]thatthenormalizedmaxi-mumlikelihoodmodelachievestheminimum.Thenormal-izedmaximumlikelihoodisdefinedas:PNML(x|C)=ˆP(x|C)Px2XˆP(x|C)(13)whereˆP(x|C)isthemaximalposteriorlikelihoodofxusingamodelfromC,i.e.ˆP(x|C)=maxc2CP(x|c).Usingthenormalizedmaximumlikelihoodcode,wecanreformulateL⇤(x|C)inEq10as:L⇤(x|C)=min(L(x), logPNML(x|C))=min(L(x), logˆP(x|C)Px0ˆP(x0|C))ˆP(x|C)=maxc2CP(x|c)(14)Similartotwo-partcode,eachwordxwillbeassignedtotheconceptcthathasthemaximalposteriorprobabilityP(x|c).3.4IntegratingAttributesInourMDLmodel,weuseP(x|c)tocharacterizetherela-tionshipbetweenaconceptcandaninputwordx.Uptonow,wehaveassumedthattherelationshipistheisArela-tionship,andP(x|c)isdefinedasinEq1.However,theinputmaycontainwordsthatareattributesorpropertiesofacon-cept,asin{population,president,location},whichtriggerstheconceptcountry.Toincorporateattributes,wecombinetheisAandtheis-PropertyOfrelationstoaunifiedprobabilisticmodel.Inprac-tice,itisrarethataninputwordisbothaninstanceandanattributeofaconcept.Asin[Songetal.,2011],wecombinethetypicalityusinganoisy-ormodel:P(c|x)=1 (1 Pe(c|x))(1 Pa(c|x))(15)wherePe(c|x)denotestheisAtypicalityasdefinedinEq1,andPa(c|x)denotestheattributetypicalityasdefinedinEq3.Intuitively,P(c|x)isthelikelihoodthatthewordxinvokesconceptc,bybeingeitheritsinstanceorattribute.There-versedtypicalityP(x|c)isinferredusingtheBayesrule:P(x|c)=P(c|x)P(x)/P(c).3.5TradeoffbetweenCoverageandMinimalityInpractice,itmaybemoredesirabletolimitthenumberofconcepts,ortogeneratemoreconceptsforbettercoverageofmeaning.Wethusextendourmodeltoaddanadjustableparameterforbalancingtheimportanceofconceptsandtags.WereformulatethefinalMDLmeasureto:C⇤=argminC↵L(C)+(1 ↵)L(X|C)=argminC↵Xci2CL(ci)+(1 ↵)Xxi2XL⇤(xi|C)(16)whereXistheinput,Cissetofconceptsusedtoencodetheinput,L⇤(x|C)isthecodelengthofindividualword,de-pendingonwhethertwo-partcodeorNMLcodeisused,↵isaparameterthatcanbeusedtotradeoffcoverageandmini
本文标题:肖仰华:基于知识图谱的用户画像关键技术
链接地址:https://www.777doc.com/doc-4616736 .html