您好,欢迎访问三七文档
当前位置:首页 > 商业/管理/HR > 项目/工程管理 > DL-Multimodal-multitask-learning
MultimodallearningandMultitasklearningHonglakLee(UniversityofMichigan)CVPR2014TutorialDeepLearningforComputerVisionOutline•MultimodalDeepLearning–Audio+Video–Image+Text•DeepTransfer/MultitaskLearning–Generalizationofdeeplearningfeaturesovermultitasks–Disentanglingfactorsofvariations•Motivation:Singledeeplearningalgorithmsthatcombinemultipleinputdomains–Images–Audio&speech–Video–Text–Roboticsensors–Time-seriesdata–…MultimodalDeepLearningimageaudiotext•Advantages–Improvedrecognitionperformance–Robustnesstomissingvaluesormissingmodalities•Keyquestions–Howcanwecaptureassociationsbetweenheterogeneousmodalitiesbylearningmid-levelfeaturerepresentations?MultimodalDeepLearningMultimodaldeeplearningfromaudio-visualdataAudiovisualspeechrecognition•Keyproblems–Canweimprove“lip-reading”performancebylearningfeaturesfromvideoandspeech?–Doesmultimodalfeaturelearningimprovespeechrecognition?–Canwelearnsharedrepresentationthatcandorobustrecognitionwhensomemodalitiesaremissingattesttime?•Relatedwork–Potamianosetal.,Audio-visualautomaticspeechrecognition:Anoverview,IssuesinVisualandAudio-VisualSpeechProcessing.2004–Matthewsetal.,Extractionofvisualfeaturesforlipreading,PAMI2004–Gurban,M.andThiran,J.P.Informationtheoreticfeatureextractionforaudio-visualspeechrecognition.IEEETSP,2009MultimodalFeatureLearning•Lipreadingviamultimodalfeaturelearning(audio/visualdata)Slidecredit:J.NgiamMultimodalFeatureLearning•Lipreadingviamultimodalfeaturelearning(audio/visualdata)Slidecredit:J.NgiamQ.Isconcatenatingthebestoption?MultimodalFeatureLearning•Concatenatingandlearningfeatures(viaasinglelayerlearning)doesn’tworkMostly“unimodal”featuresarelearnedSlidecredit:J.NgiamMultimodalFeatureLearning•Bimodalautoencoder–Idea:predictunseenmodalityfromobservedmodalityJ.Ngiam,A.Khosla,M.Kim,J.Nam,H.Lee,A.Y.Ng.Multimodaldeeplearning.ICML2011.MultimodalFeatureLearning•Visualizationoflearnedfilters•Results:AVLettersLipreadingdatasetAudio(spectrogram)andVideofeatureslearnedover100mswindowsMethodAccuracyZhaoetal.(IEEEMultimedia2009)58.9%Multimodaldeepautoencoder(Ngiametal.,ICML2011)65.8%Speechrecognitionwithnoise•Multimodalfeaturelearningimprovesspeechrecognitionwhentheaudioinputiscorruptedwithnoise–ClassificationonCUAVEdataset(digitrecognition)FeatureRepresentationAccuracy(CleanAudio)Accuracy(NoisyAudio)AudioRBM95.8%75.8%±2.0%MultimodalDAE90.0%77.3%±1.4%MultimodalDAE+AudioRBM94.4%82.2%±1.2%J.Ngiam,A.Khosla,M.Kim,J.Nam,H.Lee,A.Y.Ng.Multimodaldeeplearning.ICML2011.Robustnesstomissingmodalities•“Learningtosee”experiments•PerformanceonCUAVEdatasetTrain/TestMethodAccuracyAudio-VideoRaw-CCAfeatures41.9%“deep”-CCAfeatures57.3%Video-AudioRaw-CCAfeatures42.9%“deep”-CCAfeatures91.7%Multimodaldeeplearningfromimage&textLearningfromimagesandtext•Keyproblems–Improvingrobustnessgivenimagesandtext–Generatingtextdescriptorsfromimages–Retrievingimagesfromtextqueries•Relatedwork–M.Guillaumin,J.Verbeek,andC.Schmid,CVPR2010–M.Huiskes,B.Thomee,andM.Lew,MultimediaInformationRetrieval,2010–E.Xing,R.Yan,andA.Hauptmann.UAI2005–G.Kulkarni,V.Premraj,S.Dhar,S.Li,Y.Choi,A.C.Berg,T.L.Berg,CVPR2011–…pentax,k10d,kangarooislandsouthaustralia,saaustraliasealion300mmsandbanks,lake,lakeontario,sunset,walking,beach,purple,sky,water,clouds,overtheexcellencenotextcamera,jahdakine,lightpainting,reflectiondoublepaneglasswowiekazowietop20buperfliesmickikrimmel,mickipedia,headshotSlidecredit:R.SalakhutdinovTrainingData•SamplesfromtheMIRFlickrDatasetN.SrivastavaandR.Salakhutdinov,MultimodallearningwithdeepBoltzmannmachines,NIPS2012Tasks•ImproveClassification•FillinMissingModalities(image-text)•Retrievedatafromonemodalitywhenqueriedusingdatafromanothermodalitypentax,k10d,kangarooislandsouthaustralia,saaustraliaaustraliansealion300mmSEA/NOTSEAbeach,sea,surf,strand,shore,wave,seascape,sand,ocean,wavesbeach,sea,surf,strand,shore,wave,seascape,sand,ocean,wavesSlidecredit:R.SalakhutdinovMultimodalDeepBoltzmannMachineSlidecredit:R.Salakhutdinov•Jointdensitymodel•Modalityspecificlower-layersofDBM*•Fusionatthetop(undirectedacrossalllayers)•Inference:–Gibbssamplingormean-field;(bottom-upandtop-down)•Learning:–stochasticapproxationN.SrivastavaandR.Salakhutdinov,NIPS2012*R.Salakhutdinov&G.Hinton,DeepBoltzmannmachine,AISTATS2009InferringmissingmodalitiesGeneratingtagsgivenimageRetrievingimagesfromtextSlidecredit:R.SalakhutdinovN.SrivastavaandR.Salakhutdinov,NIPS2012ClassificationPerformance•Multimodalinputs•Unimodal(image-only)inputsModelMeanAPRandom0.124LDA(Huiskesetal.)0.492SVM(Huiskesetal.)0.475DBM0.609ModelMeanAPImage-SVM(Huiskesetal.)0.375Image-onlyDBM0.469DBM(zero-outtextinputs)0.522DBM(generatetext)0.531N.SrivastavaandR.Salakhutdinov,NIPS2012observedobservationDeepVisual-SemanticEmbedding•Keyidea–Twosuccessfulrepresentations•Images:CNNfeatures•Text(words):wordembedding(viaskipgram)–Associateimagefeaturesandwordembeddingsothatwecaninferabout“unknown”class•Relatedwork–H.Larochelle,D.Erhan,Y.Bengio.Zero-dataLearningofNewTasks.AAAI2008.–R.Socher,M.Ganjoo,H.Sridhar,O.Bastani,C.D.Manning,A.Y.Ng.Zero-ShotLear
本文标题:DL-Multimodal-multitask-learning
链接地址:https://www.777doc.com/doc-4945663 .html