您好,欢迎访问三七文档
当前位置:首页 > 商业/管理/HR > 信息化管理 > 基于机器学习的银行卡消费数据预测与推荐
1HeadlineGoesHereMachinelearninginfinanceusingSparkMLpipelineWhoamI?OutlineSparkandML/MLlibbackgroundSparkMLpipelineHyperparametertuningSparkML/MLlibfeaturetransformers&algorithmsFinancialusercasesCreditscoringcaseSparkbackgroundDistributedcomputingengineApacheopensourceBuiltforspeed,easeofuse,andsophisticatedanalyticsResilientDistributedDataset(RDD)ExpressiveAPIsinPython,Java,ScalaandRMachinelearninginSparkSparkisfirstgeneralpurposebigdataprocessingenginebuildforMLfromdayoneTheinitialdesigninSparkwasdrivenbyMLoptimizationCaching-ForrunningondatamultipletimesAccumulator-TokeepstateacrossmultipleiterationsinmemoryGoodsupportforCPUintensivetaskswithlazinessAggregate&TreeAggregateOneoftheexamplesinSparkfirstversionwasofMLInputiteration1iteration2iteration3...iteration1iteration2...InputKey:KeepWorkingSetinRAMone-timeprocessingDistributedmemorySparkforDataScienceDataFramesIntuitivemanipulationofdistributedstructureddataFamiliarAPIbasedonR&PythonPandasDistributed,optimizedimplementationMachineLearningPipelinesIntegrationwithDataFramesFamiliarAPIbasedonscikit-learnSimpaleparametertuningMLWorkflowsarecomplexImageclassificationpipelineSpecifypipelineInspect&debugRe-runonnewdataTuneparametersMLWorkflowarecomplexDataSource1DataSource3DataSource2ExtraceFeaturesExtraceFeaturesFeatureTransform1FeatureTransform2FeatureTransform3ModelTrainer1ModelTrainer2ModelTrainer3BestModelEvaluateEnsembleKeyabstractionofSparkMLpipelineTransformerFeaturetransformers(e.g.,OneHotEncoder)andtrainedMLmodels(e.g.,LogisticRegressionModel).EstimatorMLalgorithmsfortrainingmodels(e.g.,LogisticRegression)EvaluatorTheseevaluatepredictionsandcomputemetrics,usefulfortuningalgorithmparameters(e.g.,BinaryClassificationEvaluator).ExampleDatasourcesforDataFramesLibSVMRelationvaldf=sqlContext.read.format(“libsvm”).load(path)LoaddataLoaddataTokenizerhashingTFLogisticRegressionevaluatepredictlabelInttextStringLoaddataLoaddataTokenizerhashingTFLogisticRegressionevaluatepredictlabelIntwordsSeq[String]FeaturetransformLoaddataTokenizerhashingTFLogisticRegressionevaluatepredictlabelIntwordsVectorFeaturetransformLoaddataTokenizerhashingTFLogisticRegressionevaluatepredictlabelIntfeaturesVectorpredictionIntTrainandevaluatemodelLoaddataTokenizerhashingTFLogisticRegressionevaluatepredictTrainandevaluatemodelTraindataTokenizerhashingTFLogisticRegressionevaluatepredictTestdataTokenizerhashingTFLogisticRegressionevaluatepredictRe-runexactlythesamewayConcisecodevaltokenizer=newTokenizer().setInputCol(text).setOutputCol(words)valhashingTF=newHashingTF().setNumFeatures(1000).setInputCol(tokenizer.getOutputCol).setOutputCol(features)vallr=newLogisticRegression().setMaxIter(10).setRegParam(0.01)valpipeline=newPipeline().setStages(Array(tokenizer,hashingTF,lr))valmodel=pipeline.fit(trainingDataset)model.transform(testDataset)DatasetHyperparametertuningExtract/Transform#features=100TrainingregParam=0.01EvaluationExtract/Transform#features=200Extract/Transform#features=400TrainingregParam=0.1TrainingregParam=1.0CrossvalidationGiven:EstimatorParametergridEvaluatorFindbestparametersormodels//Buildaparametergrid.valparamGrid=newParamGridBuilder().addGrid(hashingTF.numFeatures,Array(10,20,40)).addGrid(lr.regParam,Array(0.01,0.1,1.0)).build()//Setupcross-validation.valcv=newCrossValidator().setNumFolds(3).setEstimator(pipeline).setEstimatorParamMaps(paramGrid).setEvaluator(newBinaryClassificationEvaluator)//Fitamodelwithcross-validation.valcvModel=cv.fit(trainingDataset)TransformerDescriptionscikit-learnBinarizerThresholdnumericalfeaturetobinaryBinarizerBucketizerBucketnumericalfeaturesintorangesElementwiseProductScaleeachfeature/columnseparatelyHashingTFHashtext/datatovector.ScalebytermfrequencyFeatureHasherIDFScalefeaturesbyinversedocumentfrequencyTfidfTransformerNormalizerScaleeachrowtounitnormNormalizerOneHotEncoderEncodek-categoryfeatureasbinaryfeaturesOneHotEncoderFeatureTransformersTransformerDescriptionscikit-learnPolynomialExpansionCreatehigher-orderfeaturesPolynomialFeaturesRegexTokenizerTokenizetextusingregularexpressions(partoftextmethods)StandardScalerScalefeaturesto0meanand/orunitvarianceStandardScalerStringIndexerConvertStringfeatureto0-basedindicesLabelEncoderTokenizerTokenizetextonwhitespace(partoftextmethods)VectorAssemblerConcatenatefeaturevectorsFeatureUnionVectorIndexerIdentifycategoricalfeatures,andindexWord2VecLearnvectorrepresentationofwordstok=Tokenizer(inputCol=text,outputCol=words)htf=HashingTF(inputCol=words,outputCol=tf,numFeatures=200)w2v=Word2Vec(inputCol=text,outputCol=w2v)ohe=OneHotEncoder(inputCol=userGroup,outputCol=ug)va=VectorAssembler(inputCols=[tf,w2v,ug],outputCol=features)pipeline=Pipeline(stages=[tok,htf,w2v,ohe,va])DiscreteContinousSupervisedClassificationLogisticRegression(withElastic-Net)SVMDecisionTreeRandomForestGBTNaiveBayesMultilayerPerceptronOneVsRestRegressionLinearRegression(withElastic-Net)DecisionTreeRandomForestGBTAFTSurvivalRegressionIsotonicRegressionUnsupervisedClusteringKMeansGaussianMixtureLDAPowerIterationClusteringDimen
本文标题:基于机器学习的银行卡消费数据预测与推荐
链接地址:https://www.777doc.com/doc-1437360 .html