您好,欢迎访问三七文档
68第七章聚类分析§1系统聚类法(I)一、距离系数聚类分析是研究“物以类聚”的一种统计方法,通常以“距离”和“相似系数”为依据来进行聚类。本节以“距离”为依据的有最短距离法、最长距离法及中间距离法等聚类方法。设有n个样品,每个样品测定m个指标,其数据矩阵为:nmnmnxxxxXXX11111,,计算点Xi与点Xj间的距离公式有:(1)绝对值距离mkjkikijxxd1(2)欧氏距离mkjkikijxxd12)((3)明氏距离qmkqjkikijxxqd11]||[)(①mkjkikijxxdq1||)1(1时即绝对值距离②2112]||[)2(2mkjkikijxxdq时即欧氏距离二、聚类步骤(1)将n个样品各自成一类;(2)计算样品间的距离,将距离最近的两个样品并成一类;(3)计算新类与其余各类的距离,再将距离最近的两类合并,重复上述步骤直到所有样品归成一类为止。三、聚类方法(最短距离法、最长距离法、中间距离法)类与类之间的距离有许多定义的方法,就产生不同的聚类方法,以下用dij表示样品Xi与Xj间的距离,用Dij表示类Gi与Gj间的距离。首先介绍最短距离法、最长距离法及中间距离法。1.最短距离法(1)ijGxGxpqdDqjpimin(2)若qprGGG,,则新类Gr与其它各类Gk),(qpk间距离:69kqkpkrDDD,min2.最长距离法(1)ijGxGxpqdDqjpimax(2)kqkpkrDDD,max3.中间距离法pqkqkpkrDDDD22224121212222221acbmabmac四、例题a例1已知5个大豆品种及一个指标(荚数/株),试用最短距离法聚类,其观测数据如表x1x2x3x4x5荚数/株6158.749.243.839.5解:采用mkjkikijmxxd1)1(表1D0G1G2G3G4G5G1={x1}0G2={x2}2.30G3={x3}11.89.50G4={x4}17.214.95.40G5={x5}21.519.29.74.30表2D1G6={x1,x2}G3={x3}G4={x4}G5={x5}G6={x1,x2}0G3={x3}9.50G4={x4}14.95.40G5={x5}19.29.74.30表3D2G6={x1,x2}G3={x3}G7={x4,x5}G6={x1,x2}0G3={x3}9.50G7={x4,x5}14.95.40表4D3G6={x1,x2}G8={x4,x5,x3}G6={x1,x2}070G8={x4,x5,x3}9.50聚类表分类数目品种归类距离系数4{x1,x2},{x3},{x4},{x5}2.33{x1,x2},{x3},{x4,x5}4.32{x1,x2},{x3,x4,x5}5.41{x1,x2,x3,x4,x5}9.5聚类图(谱系图)例2已知七个小麦品种及三个指标,试用最短距离法聚类,其观测数据如表:单产(公斤/亩)穗粒数(粒)千粒重(克)x1297.037.535.3x2312.539.537.5x3279.030.533.6x4332.535.236.2x5352.035.837.6x6382.029.340.9x7374.534.639.2解:(1)数据模型,取对数lgx(2)计算品种间距离采用欧氏距离公式:312)(kjkikijxxd单产穗粒数千粒重x12.47281.57401.5478x22.49491.60041.5740x32.44561.48431.5263x42.52181.54651.5587x52.54651.55391.5752x62.58201.46691.6117x72.57351.53911.5933(3)逐步聚类表1:D0G1G2G3G4G5G6G7G1={x1}071G2={x2}0.04330G3={x3}0.09620.13490G4={x4}0.05730.06220.10360G5={x5}0.08120.06950.16070.03060G6={x6}0.37120.17170.16140.11300.10080G7={x7}0.11590.10150.15040.06260.03570.07500表2:D1G1={x1}G2={x2}G3={x3}G8={x4.5}G6={x6}G7={x7}G10G20.04330G30.09620.13490G8={x4.5}0.05730.06220.10360G60.37120.17170.16140.10080G70.11590.10150.15040.03570.07500表3:D2G1G2G3G9={x4.5.7}G6G10G20.04330G30.09620.13490G9={x4.5.7}0.05730.06220.10360G60.37120.17170.16140.07500表4:D3G10={x1,2}G3G9={x4.5.7}G6G10={x1,2}0G30.09620G9={x4.5.7}0.05730.10360G60.17170.16140.07500表5:D4G11={x1,2,4,5,7}G3G6G11={x1,2,4,5,7}0G30.09620G60.07500.16140表6:D5G12={x1,2,4,5,7,6}G3G12={x1,2,4,5,7,6}0G30.09620分类数目品种归类距离系数6{x1},{x2},{x3},{x4,x5},{x6},{x7}0.03065{x1},{x2},{x3},{x4,x5,x7},{x6}0.0357421,xx,{x3},{x4,x5,x7},{x6}0.04333{x3},{x1,x2,x4,x5,x7},{x6}0.0573722{x3},{x1,x2,x4,x5,x7,x6}0.07501{x1,x2,x4,x5,x6,x3}0.0962(4)分类结果分析七个品种分成四类(0.045~0.05)x6粒重产量高X4,5,7粒重产量中等偏高x1,2粒重中等产量中等x3粒轻产量低(5)聚类图(谱系图)例3用中间距离法对例1样品进行聚类分析x1x2x3x4x5荚数/株6158.749.243.839.5解:表1D20G1G2G3G4G5G1={x1}0G2={x2}5.290G3={x3}139.2472.250G4={x4}295.84222.0129.160G5={x5}462.25368.6494.0918.490表1中,5.29最小,G1,G2合并为G6,再用递推公式pqiqipirDDDD2222412121计算D236,D246,D256,得42.104412121122322312362DDDD表2D21G6G3G4G5G6={x1,x2}0G3={x3}104.420G4={x4}257.6029.160G5={x5}414.1294.0618.490用递推公式计算D267,D237如:7379.383412121452652642672DDDD表3D22G6G3G7G6={x1,x2}0G3={x3}104.420G7={x4,x5}383.7915.420用公式24.240412121732632672682DDDD表4D23G6G8G6={x1,x2}0G8={x3,x4,x5}240.240聚类结果与上面结果一样。§2系统聚类法(II)一、重心法上面介绍的三种聚类方法在定义类与类间距离时,没有考虑每一类的样品数,考虑样品数可类似物理中引入重心作为每个类的代表,这时类与类间距离就可用重心间距离来表示。对样品分类来说,每一类的重心就是属于该类均值。设Gp和Gq的重心分别是px,qx,则Gp和Gq间距离是qpxxpqdD。当类与类间距离采用重心间距离,称为重心法。如果当新类产生后仍采用上述公式计算它与旧类间距离,比较麻烦,当dij采用欧氏距离时可得到较方便的递推公式。设Gp和Gq并成Gr,样品数目分别是np,nq,nr且nr=np+nq,重心分别是rqpxxx,,(均为m维向量),显然有)(1qqpprrxnxnnx又设某类Gk,其重心是kx,且Gk与Gr间距离为:Dkr,则pqrqrpkqrqkprpkrDnnnnDnnDnnD2222(证明略)这就是重心法的递推公式,利用这个公式,计算类间距离较方便。例1设有6个样品x1,x2,…,x6,测了一个指标,其数据如表x1x2x3x4x5x6A1257910试用重心法聚类。74解:(1)计算样品间距离,采用欧氏距离:设G1={x1},G2={x2},…,G6={x6}jiijxxd(m=1,n=1),计算得:表1:D20G1G2G3G4G5G6G10G210G342320G46252220G5827242220G69282523210(2)逐步归类计算类间距离公式为pqrqrpkqrqkprpkrDnnnnDnnDnnD2222因为1562122DD所以72,1GGG计算新类G7与其余各类距离。这里np=nq=1,nr=np+nq=2表2:D21G7G3G4G5G6G7{x1,x2}0G312.30G430.340G556.31640G672.325910如3.1221212121122322312372DDDD所以G8={G5,G6},表3:D22G7G3G4G8G70G312.30G430.340G864.020.36.30G9={G3,G4}表4:D23G7G9G8G7075G920.30G864.012.30},,,{},{65438910xxxxGGG表10:D24G7G10G70G1039.10(3)聚类表分类数目样品归类D25{x1,x2},{x3},{x4},{x5},{x6}14{x1,x2},{x3},{x4},{x5,x6}13{x1,x2},{x3,x4},{x5,x6}42{x1,x2},{x3,x4,x5,x6}12.31{x1,x2,x3,x4,x5,x6}39.1(4)聚类图(谱系图)二、离差平方和法(ward法)Word法来自方差分析。如果类分得正确,同一类样品的离差平方和应当小,类与类间离差平方和大。设有n个样品,分成k类:G1,…,Gk,xit表示Gt中的第i个样品(xit为m维向量),nt表示Gt样品数,tx为Gt均值,则Gt中样品离差平方和为:)()(1titnitittxxxxSt总的类内离差平方和为:kttSS1当k固定时,要选择使S达到极小的分类。当n,k较大时,分类数目相当大,如n=20,k=2,R(20,2)=219-1=524289,要从这么多分类中来选最小S一般不可能。Ward法—求局部最小解的方法。(1)n个样品各自成一类;(2)将其中某两个样品合成一类使S增加最小(这时类的数目减小到n-1个);(3)再合并其中两类,使S增加最小,直到所有样品归为一类为止。例2设有6个小麦品种x1,x2,x3,x4,x5,x6,观测一个指标,其数据如表,试用Ward法聚类。76x1x2x3x4x5x6穗数/株9.27.24.95.05.87.0解:(1)计算离差平方和记G1={x1},…,G6={x6}根据1()()tntittittiSxxxx计算得表1:S0G1G2G3G4G5G6G10G220G39.242.640G48.822.420.010G55.780.980.400.320G62.420.022.202.000.720如2)22.72.92.7()22.72.92.9(2212S2.9)29.42.99.4()29.42.92.9(2213S(2)逐步归类因为S0=0.01最小,合并G3,G4,记G7={G3,G4}={x3,x4}计算新类G7
本文标题:第七章聚类分析
链接地址:https://www.777doc.com/doc-2210001 .html