您好,欢迎访问三七文档
当前位置:首页 > 电子/通信 > 综合/其它 > scikit-learn学习笔记
简介:......................................................................2主要特点:...................................................................2scikit-learn安装:(ubuntu版本14.04.1).........................................2Classification..............................................................21.监督学习...........................................................21.1广义线性模型:................................................21.2支持向量机...................................................91.3随机梯度下降................................................101.4最近邻......................................................101.5GaussianProcesses.............................................151.6Crossdecomposition............................................161.7NaiveBayes..................................................161.8DecisionTrees.................................................171.9Ensemblemethods.............................................201.10Multiclassandmultilabelalgorithms..............................251.11Featureselection..............................................261.14Isotonicregression............................................292...................................................................292.3Clustering....................................................292.5Decomposingsignalsincomponents(matrixfactorizationproblems)......323.Modelselectionandevaluation.........................................323.1Cross-validation:evaluatingestimatorperformance....................323.2GridSearch:Searchingforestimatorparameters......................353.3Pipeline:chainingestimators.....................................373.4FeatureUnion:Combiningfeatureextractors.........................383.5.Modelevaluation:quantifyingthequalityofpredictions...............383.6.Modelpersistence..............................................423.7.Validationcurves:plottingscorestoevaluatemodels..................434...................................................................444.2Preprocessingdata..............................................444.4RandomProjection.............................................49简介:scikit-learn是一个用于机器学习的Python模块,建立在SciPy基础之上。主要特点:操作简单、高效的数据挖掘和数据分析无访问限制,在任何情况下可重新使用建立在NumPy、SciPy和matplotlib基础上使用商业开源协议--BSD许可证scikit-learn安装:(ubuntu版本14.04.1)安装依赖:sudoapt-getinstallbuild-essentialpython-devpython-numpypython-setuptoolspython-scipylibatlas-devlibatlas3-basepython-matplotlib安装pipsudoapt-getinstallpython-pip安装scikit-learnsudopipinstall-Uscikit-learn标准库Classification1.监督学习1.1广义线性模型:1.1.1普通最小二乘法:无偏估计的通过计算最小二乘的损失函数的最小值来求得参数得出模型通常用在观测有误差的情况,解决线性回归问题ppxwxwwxwy....),(110求实际观测值与预测值差的平方最小值数学公式:22minyXww是由sklearn.linear_model模块中的LinearRegression类实现回归LinearRegression的构造方法:sklearn.linear_model.LinearRegression(fit_intercept=True#默认值为True,表示计算随机变量,False表示不计算随机变量,normalize=False#默认值为False,表示在回归前是否对回归因子X进行归一化,True表示是,copy_X=True)LinearRegression的属性有:coef_和intercept_。coef_存储1w到pw的值,与X的维数一致。intercept_存储0w的值。LinearRegression的常用方法有:decision_function(X)#返回X的预测值yfit(X,y[,n_jobs])#拟合模型get_params([deep])#获取LinearRegression构造方法的参数信息predict(X)#求预测值#同decision_functionscore(X,y[,sample_weight])#计算公式为221truemeantruepretrueyyyyset_params(**params)#设置LinearRegression构造方法的参数值参考示例:fromsklearnimportlinear_modelX=[[0,0],[1,1],[2,2]]y=[0,1,2]clf=linear_model.LinearRegression()clf.fit(X,y)printclf.coef_printclf.intercept_printclf.predict([[3,3]])printclf.decision_function(X)printclf.score(X,y)printclf.get_params()printclf.set_params(fit_intercept=False)普通最小二乘法的复杂性:假设影响因素x为一个n行p列的矩阵那么其算法复杂度为)(2npO假设pn缺点:要求每个影响因素相互独立,否则会出现随机误差。回归用于解决预测值问题1.1.2Ridge回归有偏估计的,回归系数更符合实际、更可靠,对病态数据的拟合要强于最小二乘数学公式:2222minwyXww=0,越大,w值越趋于一致改良的最小二乘法,增加系数的平方和项和调整参数的积是由sklearn.linear_model模块中的Ridge类实现Ridge回归用于解决两类问题:一是样本少于变量个数,二是变量间存在共线性Ridge的构造方法:sklearn.linear_model.Ridge(alpha=1.0#公式中的值,默认为1.0,fit_intercept=True,normalize=False,copy_X=True,max_iter=None#共轭梯度求解器的最大迭代次数,tol=0.001#默认值0.001,solver='auto')#Ridge回归复杂性:同最小二乘法使用:fromsklearnimportlinear_modelX=[[0,0],[1,1],[2,2]]y=[0,1,2]clf=linear_model.Ridge(alpha=0.1)clf.fit(X,y)printclf.coef_printclf.intercept_printclf.predict([[3,3]])printclf.decision_function(X)printclf.score(X,y)printclf.get_params()printclf.set_params(fit_intercept=False)调整参数设置():通过广义交叉验证的方式(RidgeCV)设置调整参数RidgeCV构造方法:sklearn.linear_model.RidgeCV(alphas=array([0.1,1.,10.]),fit_intercept=True,normalize=False,scoring=None#交叉验证发生器,cv=None,gcv_mode=None,store_cv_values=False)使用示例:fromsklearnimportlinear_modelX=[[0,0],[1,1],[2,2]]y=[0,1,2]clf=linear_model.RidgeCV(alpha=[0.1,1.0,10.0])clf.fit(X,y)printclf.coef_printclf.intercept_printclf.predict([[3,3]])printclf.decision_function(X)printclf.score(X,y)printclf.get_params()printclf.set_params(fit_intercept=False)1.1.3Lasso数学公式:12221minwyXnwsamplesw估计稀疏系数的线性模型适用于参数少的情况,因其产生稀疏矩阵,可用与特征提取实现类是Lasso,此类用于监督分类较好的解决回归分析中的多重共线性问题思想:在回归系数的绝对值之和小于一个常数的约束条件下,使残差平方和最小化使用:clf=linear_model.Lasso(alpha=0.1)设置调整参数():交叉验证:LassoCV(适用于高维数据集)或LassoLarsCV(适合于样本数据比观察数据小很多)基于模式选择的信息标准:LassoLarsIC(BIC/AIC)1.1.4Elasti
本文标题:scikit-learn学习笔记
链接地址:https://www.777doc.com/doc-4263391 .html