您好,欢迎访问三七文档
当前位置:首页 > 电子/通信 > 数据通信与网络 > Blog文本内容敏感信息的自动提取技术
上海交通大学硕士学位论文Blog文本内容敏感信息的自动提取技术姓名:朱文轩申请学位级别:硕士专业:通信与信息系统指导教师:陈丽亚20080101VBlog90Blog20012002Blog550004BlogBlog4BlogWebBlogBlogBlogVIBlogID3ID3BlogID3VIITECHNOLOGYOFSENSITIVEINFORMATION’SAUTOMATICEXTRACTIONINBLOGTEXTSABSTRACTWiththerapiddevelopmentofinformationtechnologyandinformationindustryinrecentyearstheapplicationsintheInternethasincreaseddaybyday.In1990sBlogappearedinwesterncountriesandbecomethevogueinInternetby2001.In2002BlogwasintroducedtoChina.In5yearsitattractednearly50millionpeople.Thereisonebloggeroutof4netizensinChina.Bloghasbecomethe4thbiggestworldwidemedium.Withtherampancyofnetinformationcrimesactivitiesamountofresearcheshavebeenlaidouttothenetworkandsystemsecurity.ButtotheInternetmediainformationcontentsecurityitispaidattentiontoduringtheserecentyears.OnhugeopeninformationsourcesuchasBlogoncesensitiveinformationspreadsoutofcontrolInternetuserswillbegreatlyinfluencedandoursocietywillsuffergreatlost.InordertoprotectthestabilizationofcountrytheandnetworkusersfromtheintrusionofbadmessageswemusttakenecessarymeasurestomonitorandcontrolthiskindofinformationinBlogtext.MeanwhileweshouldprovidetechniquesandserviceofaccesscontroltothisinformationtoWebserviceorganization.ThusitisanurgentVIIIandimportanttasktoresearchadvancedtextinformationcontroltechnology.ThispapermaintaintheknowledgereferringtonaturallanguageunderstandingChineseinformationprocessingandsoonandcombineitwiththeresearchdevelopmentoftextinformationprocessinginourlaboratory.WeputforwardanideatobuilddecisiontreebasedontheattributesoftheBlogtextandmaketheautomaticextractionofunknownsensitiveinformationinBlogtextcometrue.InthispaperfirstlythedevelopmentofBlogisintroducedandseveralexamplesofsensitiveinformationinBlogtextarepresentedinordertoanalyzethesignificationoftextinformationfiltering.TheresearchactualityinoroutofChinaisintroducedtoo.ThenpaperreferstothetechnologyofChinesetextpreprocessingpresentationandclassification.WeintroducedautomaticsegmentationofChinesewordsvectorpresentationfortextthefeatureextractionfeaturedimensionreductionandfeatureweightcalculation.Besides,severalclassictextclassificationmethodsareintroduced.Wealsogiveintroductiontocommonusefulalgorithmsinnew-word-findorientation.Nextweintroducethemethodsofextractionofwebtextsandusefulattributes.AndalsothetechnologyofusingChinesecharactersconstituenttodealwiththecharacter-splitproblem.IXButbecauseofthespeedofusingthemonitorandcontroltechnologyanewproblemcomesupsowethinkofanewtechnologywhichbuildsdecisiontreebasedontheattributesoftheBlogtexttodiscovertheunknownsensitivetexts.WeunfoldtheconceptofdecisiontreeandsomeusefulmethodstoconstructitherewetakeID3algorithm.WepresentseveralimprovedversionsofID3algorithm.Atlastweshowtheflowchartofthewholesystemandexplainthewordofeachpartofit.UseimprovedID3algorithmtorealizethesystem,andmakecomparisonwithexistedtechnology.Theresultisexiting.Intheendofthepaperwegivesomeconclusiontotheaboveresearhworkandgivecorrespondingmeasurestosomeproblemsmaybeoccurinlaterresearchwork.KEYWORDSBlogNon-knownsensitiveinformationDecisiontreeID3algorithmBayesianIII2008116IV20081162008116111.1BlogBlogWeblogWeblogWebLogBlogHTMLBlogOffice1997JornBargerRobotWisdomWeb-log[1]Web-logBlogPeterMerholz[2]1999200220071172824700200630004[3]2374.6183.7430.69241691.3134.4%114.6%83.0%146.2%224.3%2002200320042005200620071-1[3]Figure1-1Increasetrendofthenumberofbloggers21.2BlogBlogBlogBlogBlogBlogIBMSearchCafe1998Web80%[4][5]Blog1.3[6][7]CP(R|DC)D[8][9]3[10]1.4BlogBlogBlogBlogID3ID3Blog422.12.22.2.1[11]RMM56SEGSEGTAGMMICTCLASICTCLAS97.58%97390%98%31.5Kbytes/sICTCLAS973ICTCLAS2.2.2stop-list2.2.32.3[12]VSMVectorSpaceModel60GerardSalton[13][14][15]Smart[12]71Document2TermTermList),,,(21nTTTDkTnk≤≤13TermWeightn),,,(21nTTTDkTkW),;;,;,(2211nnWTWTWTDD=),;,(21nWWWDD=kTkWnk≤≤14Similarity1D2DDegreeofRelevance),(21DDSimVSM11()(,();;,())nnVdtdtdωω=(1,2,,)itin=()idωitd()itfd()(())iidtfdωψ=TF-IDF()log()iiNtfdnψ=×NinitTF-IDFTF-IDF221()log(0.1)()(())log(0.1)iiiniiiNtfdndNtfdnω=×+=×+∑···························2-1TF-IDF812211()()si(,)cos(())(())nkikjkijnnkikjkkddmddddωωθωω===×==∑∑∑···················2-2K-2.3.12X-KL[16]9-KL2.3.2[17][18]KEK)(*)/1(*)(loglog11pPpPimmiiimmiiKE∑∑==−==················2-3PiKiMMutualInformationtc)()()^(log),(cptpctpctI×=································2-4)^(ctptc)(tpt)(cpc)()()()()(22),(DCBADBCACBADNct+×+×+×+−×=χ·················2-5tcNAtcBtcCct10Dtc∑=iiiiCPWCPWCPWPFpyTxtCrossEntro)()|(log)|()()(·············2-62.3.31Booleanweighting01≥=)0(0)0(1TFTFWt·········································2-7tWtTFt2TF-IDFTFIDFTFIDF1TermFrequencyTFTF2DocumentFrequencyDF3TF-IDFInverseDocumentFrequencyIDF11IDFNn)/log(nNIDF=TF-IDF3MIIG2xCHItermEntropyweighting1[]TFTF*IDF2[TF*]TFTF*IDF3[TF*IDF*]TFTF*IDFTFTF*IDF2.43[19]2.4.112NaiveBayesKNN1NaïveBayes[20]MC={c1cicM}Pcii=12MPci=ci/xcjPx/ciBayescjPci/x)()()/()/(xPcPcxPxcPiii⋅=·······························2-8(/)(/),1,2,,;1,2,,,ijjPcxMaxPcxiMjM===icx∈········2-93-23-13-2)]()/([)()/(jjjiicPcxPMaxcPcxP=Mi,,2,1=Mj,,2,1=icx∈·························
本文标题:Blog文本内容敏感信息的自动提取技术
链接地址:https://www.777doc.com/doc-43818 .html