Statistical and Information-Theoretic Methods for

DepartmentofComputerScienceSeriesofPublicationsAReportA-2007-4StatisticalandInformation-TheoreticMethodsforDataAnalysisTeemuRoosTobepresented,withthepermissionoftheFacultyofScienceoftheUniversityofHelsinki,forpubliccriticismintheauditoriumofArppeanum(HelsinkiUniversityMuseum,Snellmaninkatu3)onJune9th,at12o’clocknoon.UniversityofHelsinkiFinlandContactinformationPostaladdress:DepartmentofComputerScienceP.O.Box68(GustafH¨allstr¨ominkatu2b)FI-00014UniversityofHelsinkiFinlandEmailaddress:postmaster@cs.Helsinki.FI(Internet)URL::+35891911Telefax:+358919151120Copyrightc2007TeemuRoosISSN1238-8645ISBN978-952-10-3988-1(paperback)ISBN978-952-10-3989-8(PDF)ComputingReviews(1998)Classiﬁcation:G.3,H.1.1,I.2.6,I.2.7,I.4,I.5Helsinki2007HelsinkiUniversityPrintingHouseStatisticalandInformation-TheoreticMethodsforDataAnalysisTeemuRoosDepartmentofComputerScienceP.O.Box68,FI-00014UniversityofHelsinki,Finlandteemu.roos@cs.helsinki.ﬁﬁ/teemu.roos/PhDThesis,SeriesofPublicationsA,ReportA-2007-4Helsinki,March2007,82+75pagesISSN1238-8645ISBN978-952-10-3988-1(paperback)ISBN978-952-10-3989-8(PDF)AbstractInthisThesis,wedeveloptheoryandmethodsforcomputationaldataanal-ysis.Theproblemsindataanalysisareapproachedfromthreeperspectives:statisticallearningtheory,theBayesianframework,andtheinformation-theoreticminimumdescriptionlength(MDL)principle.Contributionsinstatisticallearningtheoryaddressthepossibilityofgeneralizationtoun-seencases,andregressionanalysiswithpartiallyobserveddatawithanapplicationtomobiledevicepositioning.InthesecondpartoftheThesis,wediscusssocalledBayesiannetworkclassiﬁers,andshowthattheyarecloselyrelatedtologisticregressionmodels.Intheﬁnalpart,weapplytheMDLprincipletotracingthehistoryofoldmanuscripts,andtonoisereductionindigitalsignals.ComputingReviews(1998)CategoriesandSubjectDescriptors:G.3ProbabilityandStatistics:correlationandregressionanalysis,nonparametricstatisticsH.1.1SystemsandInformationTheoryI.2.6Learning:conceptlearning,induction,parameterlearningI.2.7NaturalLanguageProcessing:textanalysisI.4ImageProcessingandComputerVisionI.5PatternRecognitioniiiivGeneralTerms:dataanalysis,statisticalmodeling,machinelearningAdditionalKeyWordsandPhrases:informationtheory,statisticallearningtheory,Bayesianism,minimumdescriptionlengthprinciple,Bayesiannetworks,regression,positioning,stemmatology,denoisingPreface“Weareallshapedbythetoolsweuse,inparticular:theformalismsweuseshapeourthinkinghabits,forbetterorforworse[...]”EdsgerW.Dijkstra(1930–2002)ThisThesisisaboutdataanalysis:learningandmakinginferencesfromdata.Whatdothedatahavetosay?Tosimplify,thisistheques-tionwewouldultimatelyliketoanswer.Herethedatamaybewhateverobservationswemake,beitintheformoflabeledfeaturevectors,text,orimages—alloftheseformatsareencounteredinthiswork.Here,asusual,thecomputerscientist’smodusoperandiistodeveloprulesandalgorithmsthatcanbeimplementedinacomputer.Inadditiontocomputerscience,therearemanyotherdisciplinesthatarerelevanttodataanalysis,suchasstatistics,philosophyofscience,andvariousappliedsciences,includingengineeringandbioinformatics.Eventhesearedividedintovarioussub-ﬁelds.Forinstance,theBayesianversusnon-Bayesiandivisionrelatedtotheinterpretationofprobabilityexistsinmanyareas.Diversitycharacterizesalsothepresentwork.ThesixpublicationsthatmakethesubstanceofthisThesiscontainonlyonecross-referencebetweeneachother(theﬁfthpaperiscitedinthesixthone).Theadvantageofdiversityisthatwithmoretoolsthanjustahammer(orasupportvectormachine),allproblemsdonothavetobenails.Ofcourse,onecouldnotevenhopetobecomprehensiveandall-inclusive.Inallofthefollowing,probabilityplaysacentralrole,oftentogetherwithitscousin,thecode-length.ThisdeﬁnesadhocthescopeandthecontextofthisThesis.Hencealsoitstitle.Inordertocoverthenecessarypreliminariesandbackgroundfortheactualwork,threealternativeparadigmsfordataanalysisareencounteredbeforereachingthebackcoverofthiswork.TheThesisisdividedaccord-inglyintothreeparts:eachpartincludesabriefintroductiontooneoftheparadigms,followedbycontributionsinit.Thesepartare:1.StatisticalLearningTheory;2.theBayesianApproach;and3.MinimumDescriptionvviPartI:StatisticalLearningTheoryPartIII:MinimumDescriptionLengthPrinciplePartII:theBayesianApproachChapter1PreliminariesChapter3GeneralizationtoUnseenCasesChapter2RegressionEstimationwiththeEMAlgorithmChapter5DiscriminativeBayesianNetworkClassifiersPaper2Paper3Paper1Chapter6PreliminariesChapter8MDLDenoisingChapter7Compression-BasedStemmaticAnalysisPaper5Paper6Paper4Chapter4PreliminariesFigure1:Therelationshipsbetweenthechaptersandoriginalpublications(Papers1–6)oftheThesis.LengthPrinciple.ThestructureoftheThesisisdepictedinFigure1.Asthisisnotatextbookintendedtobeself-contained,manybasicconceptsareassumedknown.Standardreferencesare,forinstance,inprobabilityandstatistics[28],inmachinelearning[26,83],inBayesianmethods[7],andininformationtheory[19,37].Acknowledgments:Iamgratefultomyadvisors,ProfessorsPetriMylly-m¨akiandHenryTirri,fortheiradvice,fortheireﬀortsinman

Statistical and Information-Theoretic Methods for

免费阅读已结束，点击付费阅读剩下 ... 页

阅读已结束，您可以下载文档离线阅读

基于数据仓库的数据挖掘方法在经济系统中的应用研究

园林绿化施工方案

青岛啤酒广告策划文桉

成本管理责任会计 8

团队文化与班组管理--雷亮

PISA 阅读样题及点评

财务与会计管理规章

水处理超滤理论与工艺研究1

部编人教版小学语文三年级上册第三单元达标测试卷及答案3

创业教育广义与狭义区别分析

相关文档

相关搜索

Statistical and Information-Theoretic Methods for

免费阅读已结束，点击付费阅读剩下 ... 页

阅读已结束，您可以下载文档离线阅读

基于数据仓库的数据挖掘方法在经济系统中的应用研究

园林绿化施工方案

青岛啤酒广告策划文桉

成本管理 责任会计 8

团队文化与班组管理--雷亮

PISA 阅读样题及点评

财务与会计管理规章

水处理超滤理论与工艺研究1

部编人教版小学语文三年级上册第三单元达标测试卷及答案3

创业教育广义与狭义区别分析

相关文档

相关搜索

成本管理责任会计 8