您好,欢迎访问三七文档
当前位置:首页 > 医学/心理学 > 医学试题/课件 > logistic-regression(用R语言的logistic回归分析)
Logisticregression(withR)ChristopherManning4November20071TheoryWecantransformtheoutputofalinearregressiontobesuitableforprobabilitiesbyusingalogitlinkfunctiononthelhsasfollows:logitp=logo=logp1−p=β0+β1x1+β2x2+···+βkxk(1)Theoddscanvaryonascaleof(0,∞),sothelogoddscanvaryonthescaleof(−∞,∞)–preciselywhatwegetfromtherhsofthelinearmodel.Forareal-valuedexplanatoryvariablexi,theintuitionhereisthataunitadditivechangeinthevalueofthevariableshouldchangetheoddsbyaconstantmultiplicativeamount.Exponentiating,thisisequivalentto:1elogitp=eβ0+β1x1+β2x2+···+βkxk(2)o=p1−p=eβ0eβ1x1eβ2x2···eβkxk(3)Theinverseofthelogitfunctionisthelogisticfunction.Iflogit(π)=z,thenπ=ez1+ezThelogisticfunctionwillmapanyvalueoftherighthandside(z)toaproportionvaluebetween0and1,asshowninfigure1.Noteacommoncasewithcategoricaldata:Ifourexplanatoryvariablesxiareallbinary,thenfortheonesthatarefalse(0),wegete0=1andthetermdisappears.Similarly,ifxi=1,eβixi=eβi.Soweareleftwithtermsforonlythexithataretrue(1).Forinstance,ifx3,x4,x7=1only,wehave:logitp=β0+β3+β4+β7(4)o=eβ0eβ3eβ4eβ7(5)TheintuitionhereisthatifIknowthatacertainfactistrueofadatapoint,thenthatwillproduceaconstantchangeintheoddsoftheoutcome(“Ifhe’sEuropean,thatdoublestheoddsthathesmokes”).LetL=L(D;B)bethelikelihoodofthedataDgiventhemodel,whereB={β0,...,βk}aretheparametersofthemodel.Theparametersareestimatedbytheprincipleofmaximumlikelihood.Technicalpoint:thereisnoerrorterminalogisticregression,unlikeinlinearregressions.1Notethatwecanconvertfreelybetweenaprobabilitypandoddsoforaneventversusitscomplement:o=p1−pp=oo+11Logisticfunction-6-4-202460.00.20.40.60.81.0Figure1:Thelogisticfunction2BasicRlogisticregressionmodelsWewillillustratewiththeCedegrendatasetonthewebsite.cedegren-read.table(cedegren.txt,header=T)Youneedtocreateatwo-columnmatrixofsuccess/failurecountsforyourresponsevariable.Youcannotjustusepercentages.(Youcangivepercentagesbutthenweightthembyacountofsuccess+failures.)attach(cedegren)ced.del-cbind(sDel,sNoDel)Makethelogisticregressionmodel.Theshortersecondformisequivalenttothefirst,butdon’tomitspecifyingthefamily.ced.logr-glm(ced.del~cat+follows+factor(class),family=binomial(logit))ced.logr-glm(ced.del~cat+follows+factor(class),family=binomial)Theoutputinmoreandlessdetail:ced.logrCall:glm(formula=ced.del~cat+follows+factor(class),family=binomial(logit))Coefficients:(Intercept)catdcatmcatncatvfollowsP-1.3183-0.16930.17860.6667-0.76750.9525followsVfactor(class)2factor(class)3factor(class)40.53411.27041.04801.3742DegreesofFreedom:51Total(i.e.Null);42ResidualNullDeviance:958.7ResidualDeviance:198.6AIC:446.1summary(ced.logr)Call:glm(formula=ced.del~cat+follows+factor(class),family=binomial(logit))DevianceResiduals:Min1QMedian3QMax2-3.24384-1.343250.049541.014886.40094Coefficients:EstimateStd.ErrorzvaluePr(|z|)(Intercept)-1.318270.12221-10.7872e-16catd-0.169310.10032-1.6880.091459catm0.178580.089521.9950.046053catn0.666720.096516.9084.91e-12catv-0.767540.21844-3.5140.000442followsP0.952550.0740012.8722e-16followsV0.534080.056609.4362e-16factor(class)21.270450.1032012.3102e-16factor(class)31.048050.1035510.1222e-16factor(class)41.374250.1015513.5322e-16(Dispersionparameterforbinomialfamilytakentobe1)Nulldeviance:958.66on51degreesoffreedomResidualdeviance:198.63on42degreesoffreedomAIC:446.10NumberofFisherScoringiterations:4ResidualdevianceisthedifferenceinG2=−2logLbetweenamaximalmodelthathasaseparateparameterforeachcellinthemodelandthebuiltmodel.Changesinthedeviance(thedifferenceinthequantity−2logL)fortwomodelswhichcanbenestedinareductionwillbeapproximatelyχ2-distributedwithdofequaltothechangeinthenumberofestimatedparameters.Thusthedifferenceindeviancescanbetestedagainsttheχ2distributionforsignificance.Thesameconcernsaboutthisapproximationbeingvalidonlyforreasonablysizedexpectedcounts(aswithcontingencytablesandmultinomialsinSuppes(1970))stillapplyhere,butwe(andmostpeople)ignorethiscautionandusethestatisticasaroughindicatorwhenexploringtofindgoodmodels.We’reusuallymainlyinterestedintherelativegoodnessofmodels,butnevertheless,thehighresidualde-vianceshowsthatthemodelcannotbeacceptedtohavebeenlikelytogeneratethedata(pchisq(198.63,42)≈1).However,itcertainlyfitsthedatabetterthanthenullmodel(whichmeansthatafixedmeanprobabilityofdeletionisusedforallcells):pchisq(958.66-198.63,9)≈1.Whatcanweseefromtheparametersofthismodel?catdandcatmhavedifferenteffects,butbotharenotveryclearlysignificantlydifferentfromtheeffectofcata(thedefaultvalue).Allfollowingenvironmentsseemdistinctive.Forclass,allofclass2–4seemtohavesomewhatsimilareffects,andwemightmodelclassasatwowaydistinction.Itseemslikewecannotprofitablydropawholefactor,butwecantestthatwiththeanovafunctiontogiveananalysisofdeviancetable,orthedrop1functiontotrydroppingeachfactor:anova(ced.logr,test=Chisq)AnalysisofDevianceTableModel:binomial,link:logitResponse:ced.delTermsaddedsequentially(firsttolast)DfDevianceResid.DfResid.DevP(|Chi|)NULL51958.66cat4314.8847643.796.690e-673follows2228.8645414.932.011e-50factor(class)3216.3042198.631.266e-46drop1(ced.logr,test=Chis
本文标题:logistic-regression(用R语言的logistic回归分析)
链接地址:https://www.777doc.com/doc-4959630 .html