您好,欢迎访问三七文档
当前位置:首页 > 商业/管理/HR > 信息化管理 > 利用sklearn做文本分类(特征提取、knnsvm聚类)
利用sklearn做文本分类(特征提取、knnsvm聚类)数据挖掘入门与实战公众号:datadw分为以下几个过程:加载数据集提feature分类NaiveBayesKNNSVM聚类20newsgroups官网~jason/20Newsgroups/上给出了3个数据集,这里我们用最原始的20news-19997.tar.gz~jason/20Newsgroups/20news-19997.tar.gz1.加载数据集从20news-19997.tar.gz下载数据集,解压到scikit_learn_data文件夹下,加载数据,详见code注释。[python]viewplaincopy#firstextractthe20news_groupdatasetto/scikit_learn_datafromsklearn.datasetsimportfetch_20newsgroups#allcategories#newsgroup_train=fetch_20newsgroups(subset='train')#partcategoriescategories=['comp.graphics','comp.os.ms-windows.misc','comp.sys.ibm.pc.hardware','comp.sys.mac.hardware','comp.windows.x'];newsgroup_train=fetch_20newsgroups(subset='train',categories=categories);可以检验是否load好了:[python]viewplaincopy#printcategorynamesfrompprintimportpprintpprint(list(newsgroup_train.target_names))结果:['comp.graphics','comp.os.ms-windows.misc','comp.sys.ibm.pc.hardware','comp.sys.mac.hardware','comp.windows.x']2.提feature:刚才load进来的newsgroup_train就是一篇篇document,我们要从中提取feature,即词频啊神马的,用fit_transformMethod1.HashingVectorizer,规定feature个数[python]viewplaincopy#newsgroup_train.dataistheoriginaldocuments,butweneedtoextractthe#featurevectorsinordertomodelthetextdatafromsklearn.feature_extraction.textimportHashingVectorizervectorizer=HashingVectorizer(stop_words='english',non_negative=True,n_features=10000)fea_train=vectorizer.fit_transform(newsgroup_train.data)fea_test=vectorizer.fit_transform(newsgroups_test.data);#returnfeaturevector'fea_train'[n_samples,n_features]print'Sizeoffea_train:'+repr(fea_train.shape)print'Sizeoffea_train:'+repr(fea_test.shape)#11314documents,130107vectorsforallcategoriesprint'Theaveragefeaturesparsityis{0:.3f}%'.format(fea_train.nnz/float(fea_train.shape[0]*fea_train.shape[1])*100);结果:Sizeoffea_train:(2936,10000)Sizeoffea_train:(1955,10000)Theaveragefeaturesparsityis1.002%因为我们只取了10000个词,即10000维feature,稀疏度还不算低。而实际上用TfidfVectorizer统计可得到上万维的feature,我统计的全部样本是13w多维,就是一个相当稀疏的矩阵了。**************************************************************************************************************************上面代码注释说TF-IDF在train和test上提取的feature维度不同,那么怎么让它们相同呢?有两种方法:Method2.CountVectorizer+TfidfTransformer让两个CountVectorizer共享vocabulary:[python]viewplaincopy#----------------------------------------------------#method1:CountVectorizer+TfidfTransformerprint'*************************nCountVectorizer+TfidfTransformern*************************'fromsklearn.feature_extraction.textimportCountVectorizer,TfidfTransformercount_v1=CountVectorizer(stop_words='english',max_df=0.5);counts_train=count_v1.fit_transform(newsgroup_train.data);printtheshapeoftrainis+repr(counts_train.shape)count_v2=CountVectorizer(vocabulary=count_v1.vocabulary_);counts_test=count_v2.fit_transform(newsgroups_test.data);printtheshapeoftestis+repr(counts_test.shape)tfidftransformer=TfidfTransformer();tfidf_train=tfidftransformer.fit(counts_train).transform(counts_train);tfidf_test=tfidftransformer.fit(counts_test).transform(counts_test);结果:*************************CountVectorizer+TfidfTransformer*************************theshapeoftrainis(2936,66433)theshapeoftestis(1955,66433)Method3.TfidfVectorizer让两个TfidfVectorizer共享vocabulary:[python]viewplaincopy#method2:TfidfVectorizerprint'*************************nTfidfVectorizern*************************'fromsklearn.feature_extraction.textimportTfidfVectorizertv=TfidfVectorizer(sublinear_tf=True,max_df=0.5,stop_words='english');tfidf_train_2=tv.fit_transform(newsgroup_train.data);tv2=TfidfVectorizer(vocabulary=tv.vocabulary_);tfidf_test_2=tv2.fit_transform(newsgroups_test.data);printtheshapeoftrainis+repr(tfidf_train_2.shape)printtheshapeoftestis+repr(tfidf_test_2.shape)analyze=tv.build_analyzer()tv.get_feature_names()#statisticalfeatures/terms结果:*************************TfidfVectorizer*************************theshapeoftrainis(2936,66433)theshapeoftestis(1955,66433)此外,还有sklearn里封装好的抓feature函数,fetch_20newsgroups_vectorizedMethod4.fetch_20newsgroups_vectorized但是这种方法不能挑出几个类的feature,只能全部20个类的feature全部弄出来:[python]viewplaincopyprint'*************************nfetch_20newsgroups_vectorizedn*************************'fromsklearn.datasetsimportfetch_20newsgroups_vectorizedtfidf_train_3=fetch_20newsgroups_vectorized(subset='train');tfidf_test_3=fetch_20newsgroups_vectorized(subset='test');printtheshapeoftrainis+repr(tfidf_train_3.data.shape)printtheshapeoftestis+repr(tfidf_test_3.data.shape)结果:*************************fetch_20newsgroups_vectorized*************************theshapeoftrainis(11314,130107)theshapeoftestis(7532,130107)3.分类3.1MultinomialNaiveBayesClassifier[python]viewplaincopy#######################################################MultinomialNaiveBayesClassifierprint'*************************nNaiveBayesn*************************'fromsklearn.naive_bayesimportMultinomialNBfromsklearnimportmetricsnewsgroups_test=fetch_20newsgroups(subset='test',categories=categories);fea_test=vectorizer.fit_transform(newsgroups_test.data);#createtheMultinomialNaiveBayesianClassifierclf=MultinomialNB(alpha=0.01)clf.
本文标题:利用sklearn做文本分类(特征提取、knnsvm聚类)
链接地址:https://www.777doc.com/doc-5158207 .html