您好,欢迎访问三七文档
GeneratingGeneSummariesfromBiomedicalLiterature:AStudyofSemi-StructuredSummarizationXuLing,JingJiang,XinHe,QiaozhuMeiChengxiangZhai,BruceSchatzDepartmentofComputerScienceInstituteforGenomicBiologyUniversityofIllinoisatUrbana-Champaign,IL61801E-mail:fxuling,jiang4,xinhe2,qmei2,czhai,schatzg@uiuc.eduAbstractMostknowledgeaccumulatedthroughscienticdiscoveriesingenomicsandrelatedbiomed-icaldisciplinesisburiedinthevastamountofbiomedicalliterature.Sinceunderstand-inggeneregulationsisfundamentaltobiomedicalresearch,summarizingalltheexistingknowledgeaboutagenebasedonliteratureishighlydesirabletohelpbiologistsdigesttheliterature.Inthispaper,wepresentastudyofmethodsforautomaticallygeneratinggenesummariesfrombiomedicalliterature.Unlikemostexistingworkonautomatictextsummarization,inwhichthegeneratedsummaryisoftenalistofextractedsentences,weproposetogenerateasemi-structuredsummarywhichconsistsofsentencescoveringspe-cicsemanticaspectsofagene.Suchasemi-structuredsummaryismoreappropriatefordescribinggenesandposesspecialchallengesforautomatictextsummarization.Wepro-poseatwo-stageapproachtogeneratesuchasummaryforagivengenerstretrievingarticlesaboutageneandthenextractingsentencesforeachspeciedsemanticaspect.Weaddresstheissueofgenenamevariationintherststageandproposeseveraldifferentmethodsforsentenceextractioninthesecondstage.Weevaluatetheproposedmethodsusingatestsetwith20genes.Experimentresultsshowthattheproposedmethodscangen-erateusefulsemi-structuredgenesummariesautomaticallyfrombiomedicalliterature,andourproposedmethodsoutperformgeneralpurposesummarizationmethods.Amongalltheproposedmethodsforsentenceextraction,aprobabilisticlanguagemodelingapproachthatmodelsgenecontextperformsthebest.Keywords:Summarization,Genomics,ProbabilisticlanguagemodelPreprintsubmittedtoElsevierScience13December20061IntroductionBiomedicalliteraturehasbeenplayingacentralroleintheresearchactivitiesofallbiologists.Thegrowingamountofscienticdiscoveriesingenomicsandrelatedbiomedicaldisciplineshaveledtoacorrespondinggrowthintheamountofliter-atureinformation.Becauseofitsdauntingsizeandcomplexity,therehavebeenincreasingeffortsdevotedtointegratethishugeresourceforbiologiststodigestquickly.Understandinggenefunctionsisfundamentaltobiomedicalresearch,andonefun-damentaltaskthatbiomedicalresearchersoftenhavetoperformistondandsum-marizealltheknowledgeaboutaparticulargenefromtheliterature,aproblemthatwecallgenesummarization.Becauseoftheimportanceofgenes,therehasbeenmuchmanualeffortoncon-structinganinformativesummaryofagenebasedonliteratureinformation.Forexample,FlyBase1(R.A.DrysdaleandConsortium,2005)(oneofthemodelorganismgenomedatabase)providesatextsummaryforeachDrosophilagene,includingDNAsequence,functionaldescription,mutantinformationetc..Com-pressingandarrangingalltheknowledgefromahugeamountofliteratureintodifferentaspectsenablebiologiststoquicklyunderstandthetargetgene.However,suchgenesummariesarecurrentlygeneratedbymanuallyextractingin-formationfromliterature,whichisextremelylabor-intensiveandcannotkeepupwiththerapidgrowthoftheliteratureinformation.Asthegrowingamountofsci-enticdiscoveriesingenomicsandrelatedbiomedicaldisciplines,automaticsum-marizationofgenedescriptionsinmultipleaspectsfrombiomedicalliteraturehasbecomeanurgenttask.Onecharacteristicofaninformativegenesummaryisthatthesummaryshouldideallyconsistsofsentencesthatcoverseveralimportantsemanticaspectssuchassequenceinformation,mutantphenotype,andgeneproduct.Thatis,thesummaryissemi-structured.Forexample,Figure1showsasamplegenesummaryinFlyBaseretrievedin2005.Hereweseethatthesummaryconsistsofsentencescoveringthefollowingaspectsofagene:(1)Geneproducts(GP);(2)Expressionlocation(EL);(3)Sequenceinformation(SI);(4)Wild-typefunctionandphenotypicinfor-mation(WFPI);(5)Mutantphenotype(MP);and(6)Geneticalinteraction(GI),asannotated.Wethusproposetoframethegenesummarizationproblemastoau-tomaticallygenerateasemi-structuredsummaryconsistingofsentencescoveringthesesixaspectsofagene.Suchasummarynotonlyisitselfveryuseful,butalsocanserveasusefulentrypointstotheliteraturethroughlinkingeachaspecttothesupportingevidenceintheliterature.1http://ybase.bio.indiana.edu/2Fig.1.ExampleGeneSummaryInFlyBase.Mostexistingworkonautomatictextsummarizationhasfocusedonnewssum-marizationandthegeneratedsummaryisgenerallyunstructured,consistingofalistofsentences.Theexistingsummarizationmethodsarethusinadequateforgen-eratingasemi-structuredsummary.Inthispaper,wepresentastudyofmethodsforautomaticallygeneratingsemi-structuredgenesummariesfrombiomedicallit-erature.Althoughourstudiesmainlyfocusinthebiomedicalliteraturedomain,theapproachesweproposedaregenerallyapplicabletosemi-structuredsummarizationinotherapplications,suchasproductreviews.Undertheassumptionthatwehavesometrainingsentencesforeachaspect,generalizingourmethodsforapplyingtootherapplicationsisverystraightforward.Weproposeatwo-stageapproachtogeneratesuchasummaryforagivengene,inwhichwewouldrstretrievearticlesaboutageneandthenextractsentencesforeachofsixspeciedsemanticaspects.Whi
本文标题:Abstract Generating Gene Summaries from Biomedical
链接地址:https://www.777doc.com/doc-3324433 .html