您好,欢迎访问三七文档
当前位置:首页 > 商业/管理/HR > 其它文档 > 朴素贝叶斯python代码实现
朴素贝叶斯优点:在数据较少的情况下仍然有效,可以处理多类别问题缺点:对于输入数据的准备方式较为敏感适用数据类型:标称型数据贝叶斯准则:使用朴素贝叶斯进行文档分类朴素贝叶斯的一般过程(1)收集数据:可以使用任何方法。本文使用RSS源(2)准备数据:需要数值型或者布尔型数据(3)分析数据:有大量特征时,绘制特征作用不大,此时使用直方图效果更好(4)训练算法:计算不同的独立特征的条件概率(5)测试算法:计算错误率(6)使用算法:一个常见的朴素贝叶斯应用是文档分类。可以在任意的分类场景中使用朴素贝叶斯分类器,不一定非要是文本。准备数据:从文本中构建词向量摘自机器学习实战。[['my','dog','has','flea','problems','help','please'],0['maybe','not','take','him','to','dog','park','stupid'],1['my','dalmation','is','so','cute','I','love','him'],0['stop','posting','stupid','worthless','garbage'],1['mr','licks','ate','my','steak','how','to','stop','him'],0['quit','buying','worthless','dog','food','stupid']]1以上是六句话,标记是0句子的表示正常句,标记是1句子的表示为粗口。我们通过分析每个句子中的每个词,在粗口句或是正常句出现的概率,可以找出那些词是粗口。在bayes.py文件中添加如下代码:[python]viewplaincopy1.#coding=utf-82.3.defloadDataSet():4.postingList=[['my','dog','has','flea','problems','help','please'],5.['maybe','not','take','him','to','dog','park','stupid'],6.['my','dalmation','is','so','cute','I','love','him'],7.['stop','posting','stupid','worthless','garbage'],8.['mr','licks','ate','my','steak','how','to','stop','him'],9.['quit','buying','worthless','dog','food','stupid']]10.classVec=[0,1,0,1,0,1]#1代表侮辱性文字,0代表正常言论11.returnpostingList,classVec12.13.defcreateVocabList(dataSet):14.vocabSet=set([])15.fordocumentindataSet:16.vocabSet=vocabSet|set(document)17.returnlist(vocabSet)18.19.defsetOfWords2Vec(vocabList,inputSet):20.returnVec=[0]*len(vocabList)21.forwordininputSet:22.ifwordinvocabList:23.returnVec[vocabList.index(word)]=124.else:25.printtheword:%sisnotinmyVocabulary!%word26.returnreturnVec运行结果:训练算法:从词向量计算概率[python]viewplaincopy1.#朴素贝叶斯分类器训练函数2.#trainMatrix:文档矩阵,trainCategory:由每篇文档类别标签所构成的向量3.deftrainNB0(trainMatrix,trainCategory):4.numTrainDocs=len(trainMatrix)5.numWords=len(trainMatrix[0])6.pAbusive=sum(trainCategory)/float(numTrainDocs)7.p0Num=zeros(numWords);8.p1Num=zeros(numWords);9.p0Denom=0.0;10.p1Denom=0.0;11.foriinrange(numTrainDocs):12.iftrainCategory[i]==1:13.p1Num+=trainMatrix[i]14.p1Denom+=sum(trainMatrix[i])15.else:16.p0Num+=trainMatrix[i]17.p0Denom+=sum(trainMatrix[i])18.p1Vect=p1Num/p1Denom19.p0Vect=p0Num/p1Denom20.returnp0Vect,p1Vect,pAbusive运行结果:测试算法:根据现实情况修改分类器上一节中的trainNB0函数中修改几处:p0Num=ones(numWords);p1Num=ones(numWords);p0Denom=2.0;p1Denom=2.0;p1Vect=log(p1Num/p1Denom)p0Vect=log(p0Num/p1Denom)[python]viewplaincopy1.#朴素贝叶斯分类器训练函数2.#trainMatrix:文档矩阵,trainCategory:由每篇文档类别标签所构成的向量3.deftrainNB0(trainMatrix,trainCategory):4.numTrainDocs=len(trainMatrix)5.numWords=len(trainMatrix[0])6.pAbusive=sum(trainCategory)/float(numTrainDocs)7.p0Num=ones(numWords);8.p1Num=ones(numWords);9.p0Denom=2.0;10.p1Denom=2.0;11.foriinrange(numTrainDocs):12.iftrainCategory[i]==1:13.p1Num+=trainMatrix[i]14.p1Denom+=sum(trainMatrix[i])15.else:16.p0Num+=trainMatrix[i]17.p0Denom+=sum(trainMatrix[i])18.p1Vect=log(p1Num/p1Denom)19.p0Vect=log(p0Num/p1Denom)20.returnp0Vect,p1Vect,pAbusive21.22.#朴素贝叶斯分类函数23.defclassifyNB(vec2Classify,p0Vec,p1Vec,pClass1):24.p1=sum(vec2Classify*p1Vec)+log(pClass1)25.p0=sum(vec2Classify*p0Vec)+log(1.0-pClass1)26.ifp1p0:27.return128.else:29.return030.31.deftestingNB():32.listOPosts,listClasses=loadDataSet()33.myVocabList=createVocabList(listOPosts)34.trainMat=[]35.forpostinDocinlistOPosts:36.trainMat.append(setOfWords2Vec(myVocabList,postinDoc))37.38.p0V,p1V,pAb=trainNB0(array(trainMat),array(listClasses))39.40.testEntry=['love','my','dalmation']41.thisDoc=array(setOfWords2Vec(myVocabList,testEntry))42.printtestEntry,'classifiedas:',classifyNB(thisDoc,p0V,p1V,pAb)43.44.testEntry=['stupid','garbage']45.thisDoc=array(setOfWords2Vec(myVocabList,testEntry))46.printtestEntry,'classifiedas:',classifyNB(thisDoc,p0V,p1V,pAb)运行结果:准备数据:文档词袋模型词集模型(set-of-wordsmodel):每个词是否出现,每个词只能出现一次词袋模型(bag-of-wordsmodel):一个词可以出现不止一次[python]viewplaincopy1.#朴素贝叶斯词袋模型2.defbagOfWords2VecMN(vocabList,inputSet):3.returnVec=[0]*len(vocabList)4.forwordininputSet:5.ifwordinvocabList:6.returnVec[vocabList.index(word)]+=17.returnreturnVec示例:使用朴素贝叶斯过滤垃圾邮件(1)收集数据:提供文本文件(2)准备数据:将文本文件解析成词条向量(3)分析数据:检查词条确保解析的正确性(4)训练算法:使用我们之前建立的trainNB0()函数(5)测试算法:使用classifyNB(),并且构建一个新的测试函数来计算文档集的错误率(6)使用算法:构建一个完整的程序对一组文档进行分类,将错分的文档输出到屏幕上准备数据:切分文本使用正则表达式切分句子测试算法:使用朴素贝叶斯进行交叉验证[python]viewplaincopy1.#该函数接受一个大写字符的字串,将其解析为字符串列表2.#该函数去掉少于两个字符的字符串,并将所有字符串转换为小写3.deftextParse(bigString):4.importre5.listOfTokens=re.split(r'\W*',bigString)6.return[tok.lower()fortokinlistOfTokensiflen(tok)2]7.8.#完整的垃圾邮件测试函数9.defspamTest():10.docList=[]11.classList=[]12.fullText=[]13.#导入并解析文本文件14.foriinrange(1,26):15.wordList=textParse(open('email/spam/%d.txt'%i).read())16.docList.append(wordList)17.fullText.extend(wordList)18.classList.append(1)19.20.wordList=textParse(open('email/ham/%d.txt'%i).read())21.docList.append(wordList)22.fullText.extend(wordList)23.classList.append(0)24.25.vocabList=createVocabList(docList)26.trainingSet=range(50)27.testSet=[]28.#随机构建训练集29.foriinrange(10):30.randIndex=int(random.uniform(0,len(trainingSet)))31.testSet.append(trainingSet[randIndex])32.del(trainingSet[randIndex])33.34.trainMat=[]35.trainClasses=[]36.fordocInde
本文标题:朴素贝叶斯python代码实现
链接地址:https://www.777doc.com/doc-5080071 .html