您好,欢迎访问三七文档
March22,200617:17ProceedingsTrimSize:9inx6inapolloni-cibb06-finalBICAANDRANDOMSUBSPACEENSEMBLESFORDNAMICROARRAY-BASEDDIAGNOSISB.APOLLONIANDG.VALENTINIDipartimentodiScienzedell’Informazione,Universit`adegliStudidiMilano,ViaComelico39/41,20135Milano,Italy{apolloni,valentini}@dsi.unimi.itA.BREGADipartimentodiMatematica“F.Enriquez”,Universit`adegliStudidiMilanoViaSaldini50,20133Milano,Italyandrea.brega@unimi.itWecomparetwoensemblemethodstoclassifyDNAmicroarraydata.Themeth-odsusedifferentstrategiestofacethecourseofdimensionalityplaguingthesedata.Oneofthemprojectsdataalongrandomcoordinates,theothercompressesthemintoindependentbooleanvariables.Bothresultinrandomfeatureextractionprocedures,feedingSVMsasbaselearnersforamajorityvotingensembleclassi-fier.Theclassificationcapabilitiesarecomparable,degradingoninstancesthatareacknowledgedanomalousintheliterature.1.IntroductionThetraditionaltaxonomyofmalignancies,basedontheirmorphological,histopathological,andclinicalcharacteristics,maybesometimesineffectiveforacorrectdiagnosisandprognosisoftumors1.Indeedamorerefineddiagnosismaybeachievedexploitingthegenome-widebio-molecularchar-acteristicsoftumors,usinghighthroughputbio-technologiesbasedonlargescalehybridizationtechniques(e.g.DNAmicroarray)5.OneofthemaindrawbacksthatcharacterizesDNAmicroarraydataisrepresentedbytheirveryhighdimensionalityandlowcardinality.In-deediswellknownthatinthesecasesthecurseofdimensionalityproblemarises.Henceseveralworkspointedouttheimportanceoffeatureselectionmethodstoreducethedimensionalityoftheinputspace7.Analternativeapproachisrepresentedbydatacompressiontechniquesthatcanreducethe1March22,200617:17ProceedingsTrimSize:9inx6inapolloni-cibb06-final2dimensionalityofthedata,whileapproximatelypreservingtheirinforma-tioncontent.Asfortheirprocessing,severalauthorsrecentlyproposedtoapplyensemblemethodsforimprovingtheperformanceofstate-of-the-artclassificationalgorithmsinthecontextofgeneexpressiondataanalysis4.Inthispaperwecomparetwoensemblemethodsbasedondata-compressiontechniquesforDNA-microarray-baseddiagnosis.Thefirstoneexploitsrandomprojectionstolowerdimensionalsubspaces8,whilethesecondperformsdatacompressionthroughaBooleanIndependentCom-ponentAnalysis(BICA)algorithm13.Whilethefirstmethodhasjustbeenappliedtogeneexpressiondataanalysis3,BICAhasneverbeenpreviouslyappliedtoDNAmicroarraydataanalysis.Inthenexttwosectionsweintroducethemethods,andinSect.4weexperimentallyanalyzetheeffectivenessofthetwoapproaches,applyingthemtoDNAmicroarray-basesdiagnosisoftumors.2.RSE:RandomSubspaceEnsembleThereductionofthedimensionalityinthecontextofsupervisedanalysisofdataisusuallypursuedthroughfeatureselectionmethods.Manymethodscanbeapplied,rangingfromfiltermethods,wrappermethods,informationtheorybasedtechniquesand”embedded”methods(seee.g.6forarecentreview).Werecentlyexperimentedadifferentapproach3basedonrandomsub-spaceensemblemethods8.Forafixedn,nfeatures(genes)arerandomlyselected,accordingtotheuniformdistribution.Thenthedataoftheorig-inald-dimensionaltrainingsetisprojectedtotheselectedn-dimensionalsubspace.Theresultingdatasetisusedtotrainasuitablebaselearnerandthisprocessisrepeatedνtimesgivingraisetoanensembleofνlearningmachinestrainedondifferentrandomlyselectedsubsetsoffeatures.Theresultingsetofclassifiersarethencombinedbyusingmajorityvoting.Thismethodavoidssomecomputationaldifficultyoffeatureselection(featureselectionisanNP-hardproblem),andaparallelimplementationcanbeprovidedinanaturalway.Anywayfeatureselectionmethodscanexplicitlyselectsetsofrelevantfeatures,whilethisinformationcannotbedirectlyobtainedthroughRSensembles.Ontheotherhand,withdiffer-entrandomprojectionsofthedatawecanimprovediversitybetweenbaselearners9,whiletheoverallaccuracyoftheensemblecanbeenhancedthroughaggregationtechniques.Asaconsequencetheperformanceofagivenclassificationalgorithmmaybeenhanced.Ahigh-levelpseudo-codeMarch22,200617:17ProceedingsTrimSize:9inx6inapolloni-cibb06-final3ofthemethodissummarizedinFig.1.Inparticular,SubspaceprojectionRandomSubspaceEnsembleAlgorithmInput:-AdatasetD={(xj,tj)|1≤j≤m},xj∈X⊂Rd,tj∈C={1,...,k}-alearningalgorithmL-subspacedimensionnd-numberofthebaselearnersmOutput:-Finalhypothesishran:X→Ccomputedbytheensemble.beginfori=1toνbeginDi=Subspaceprojection(D,n)hi=L(Di)endhran(x)=argmaxt∈Ccard({i|hi(x)=t})end.Figure1.High-levelpseudo-codeoftheRSEmethodprocedureselectsan-subsetA={α1,...,αn}from{1,2,...,d},andre-turnsasoutputthenewdatasetDi={(PA(xj),tj)|1≤j≤m},wherePA(x1,...,xd)=(xα1,...,xαn).ThenewdatasetDiisthengivenasinputtoalearningalgorithmLwhichoutputsaclassifierhi.Alltheclassifiersobtainedarefinallyaggre-gatedthroughmajorityvoting,wherecard()measuresthecardinalityofaset.3.BICAnetworkAsuitablewayoftakingdecisionsbasedondataistosplitthedecisionprocessintwosteps.Thefirstisdevotedtopreprocessingdatainafeasiblewaysuchthattheycanbeinterpretedinthesecondone.Asforthefor-mer,itmirrorsrealvectorsintobooleanones,thatshouldreflectrelevantfeaturesoftheoriginaldatapatterns.Stressingthefactthatindependenceisapropertyoftherepresentationofthedatathatweuse,wesearc
本文标题:BICA AND RANDOM SUBSPACE ENSEMBLES FOR DNA MICROAR
链接地址:https://www.777doc.com/doc-8349 .html