您好,欢迎访问三七文档
关联规则分析(associationanalysis)超市例子例3.1(Groceries.txt)这是一个超市购物例子(Hahsleretal.,2006),数据中有9835笔交易,涉及169种商品。每个交易为一个顾客的购买记录,而每种商品是一个二分变量,比如,购买用1代表,未购买用0代表。通过对数据的初步计算,我们发现在单项计数中,全牛奶(wholemilk)的频数最高,为2513(频率接近26%),而其次为:其它蔬菜(othervegetables)为1903,面包(rolls/buns)为1809,苏打(soda)为1715,酸奶(yogurt)为1372等等。超过5%的顾客购买的商品频率显示在图3.1中。此外,还可以知道分别买不同数量商品的顾客人数,购买1至9种商品的人数展示在下表中:itemfrequency(relative)0.000.050.100.150.200.25frankfurtersausageporkbeefcitrusfruittropicalfruitpipfruitrootvegetablesothervegetableswholemilkbuttercurdyogurtwhipped/sourcreamdomesticeggsrolls/bunsbrownbreadpastrymargarinecoffeebottledwatersodafruit/vegetablejuicebottledbeercannedbeernapkinsnewspapersshoppingbagslibrary(arules)data(Groceries)summary(Groceries)itemFrequencyPlot(Groceries,support=0.05,cex.names=0.8)#图3.1超过5%的顾客购买的商品名字和频率术语•每一个观测称为一个事务或交易(transaction)•每一个二分变量称为一个项目或项(item)•事务数据集、项目集或项集(itemset)•用X表示一个项目或者项目集,用Y表示与X没有交的另一个项目或项目集,那么记号“X=Y”表示X和Y同时出现的一个规则(rule)•在X=Y中,称X为前项(也称为条件项或左项,antecedent,left-hand-sideorLHSoftherule),而称Y为后项(也称为结果项或右项,consequent,right-hand-sideorRHSoftherule)。信息•X=Y的支持度(support)•X=Y的置信度(confidence)•X=Y的提升(lift)记s(Z)表示事务Z在包含N个事务的整个事务数据集中的频数,用A表示事务包含X的事件,而B表示事务包含Y的事件(X和Y没有交),则:library(arules)data(Groceries)summary(Groceries)itemFrequencyPlot(Groceries,support=0.05,cex.names=0.8)#图3.1fsets-eclat(Groceries,parameter=list(support=0.05,maxlen=10))#求频繁项集inspect(fsets[1:10])inspect(SORT(fsets,by=support)[1:10])rules=apriori(Groceries,parameter=list(support=0.01,confidence=0.01))#求规则x=subset(rules,subset=rhs%in%wholemilk&lift1.2)inspect(SORT(x,by=support)[1:5])#第三章表inspect(SORT(x,by=confidence)[1:5])#第三章表#inspect(SORT(x,by=lift)[1:5])library(arules);w=read.table(f:/adbook/shopping.txt,header=TRUE,sep=\t);a=w[1:10];dim(a)[1]78610names(w)[1]“Ready.made”“Frozen.foods”“Alcohol”“Fresh.Vegetables”“Milk”[6]“Bakery.goods”“Fresh.meat”“Toiletries”“Snacks”“Tinned.goods”a=as.matrix(a);trans2-as(a,transactions);summary(trans2)#数据概况图示数据itemFrequencyPlot(trans2,support=0.1,cex.names=0.8)itemfrequency(relative)0.00.10.20.30.4Ready.madeFrozen.foodsAlcoholMilkBakery.goodsSnacksTinned.goodsfsets-eclat(trans2,parameter=list(support=0.05,maxlen=10))#求频繁项集rules=apriori(trans2,parameter=list(support=0.01,confidence=0.6))#求规则•求得规则:–rules=apriori(trans2,parameter=list(support=0.01,confidence=0.6))•查看规则:–inspect(rules[1:3])•筛选规则:–x=subset(rules,subset=rhs%in%Milk&lift1.2)•规则排序:–inspect(SORT(x,by=confidence)[1:3])连续变量(先变成分类变量)•data(AdultUCI)#library(arules)•attributes(AdultUCI)$class;attributes(AdultUCI)$names;•dim(AdultUCI);AdultUCI[1:2,]•连续变量处理:–删除•AdultUCI[[fnlwgt]]-NULL•AdultUCI[[education-num]]-NULL–分级连续变量•AdultUCI[[age]]-ordered(cut(AdultUCI[[age]],c(15,25,45,65,100)),labels=c(Young,Middle-aged,Senior,Old))•AdultUCI[[hours-per-week]]-ordered(cut(AdultUCI[[hours-per-week]],c(0,25,40,60,168)),labels=c(Part-time,Full-time,Over-time,Workaholic))•AdultUCI[[capital-gain]]-ordered(cut(AdultUCI[[capital-gain]],c(-Inf,0,median(AdultUCI[[capital-gain]][AdultUCI[[capital-gain]]0]),Inf)),labels=c(None,Low,High))•AdultUCI[[capital-loss]]-ordered(cut(AdultUCI[[capital-loss]],c(-Inf,0,median(AdultUCI[[capital-loss]][AdultUCI[[capital-loss]]0]),Inf)),labels=c(none,low,high))•Adult-as(AdultUCI,transactions);Adultitemfrequency(relative)0.00.20.40.60.8age=Middle-agedworkclass=Privaterace=Whitesex=Malecapital-gain=Nonecapital-loss=nonehours-per-week=Full-timenative-country=United-Statesincome=smallsummary(Adult)itemFrequencyPlot(Adult,support=0.5,cex.names=0.8)rules=apriori(Adult,parameter=list(support=0.01,confidence=0.6))x=subset(rules,subset=rhs%in%income=large&lift1.2)inspect(SORT(x,by=confidence)[1:5])inspect(SORT(x,by=lift)[1:5])例3.2(Adult.txt)美国普查局政府网站的数据库的例子。原本有48842个观测及15个变量。这15个变量经过挑选并转换成115个二分变量。library(arules)data(Adult)summary(Adult)rules-apriori(Adult,parameter=list(support=0.01,confidence=0.6))summary(rules)rulesIncomeSmall-subset(rules,subset=rhs%in%income=small&lift1.2)rulesIncomeLarge-subset(rules,subset=rhs%in%income=large&lift1.2)inspect(SORT(rulesIncomeSmall,by=confidence)[1:3])inspect(SORT(rulesIncomeLarge,by=confidence)[1:3])Shuttle数据(需要变成二分变量的数据)library(MASS);shuttle[1:10,]summary(shuttle)library(arules)w-as(shuttle,transactions);summary(w)rules-apriori(w,parameter=list(support=0.01,confidence=0.6))summary(rules)r.useauto-subset(rules,subset=rhs%in%use=auto&lift1.2)r.usenoauto-subset(rules,subset=rhs%in%use=noauto&lift1.2)inspect(SORT(r.useauto,by=confidence)[1:3])inspect(SORT(r.usenoauto,by=confidence)[1:3])
本文标题:关联规则分析
链接地址:https://www.777doc.com/doc-7159288 .html