博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
[分类算法] :朴素贝叶斯 NaiveBayes
阅读量:6565 次
发布时间:2019-06-24

本文共 4831 字,大约阅读时间需要 16 分钟。

1. 原理和理论基础()

2. Spark代码实例:

1)windows 单机

import org.apache.spark.mllib.classification.NaiveBayesimport org.apache.spark.mllib.linalg.Vectorsimport org.apache.spark.mllib.regression.LabeledPointimport org.apache.spark.{SparkConf, SparkContext}object local_NaiveBayes {  System.setProperty("hadoop.dir.home","E:/zhuangji/winutil/")  def main(args:Array[String]) {    val conf = new SparkConf().setMaster("local[2]").setAppName("NaiveBayes")    val sc = new SparkContext(conf)    //initiated data and labeled    val data = sc.textFile("E:/Java_WS/ScalaDemo/data/sample_naive_bayes_data.txt")    val parsedData = data.map {      line =>        val parts = line.split(',')        LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split( ' ').map(_.toDouble)) )    }    // split data    val splits=parsedData.randomSplit(Array(0.6,0.4),seed=11L)    val training=splits(0)    val test=splits(1)    //model and calculated precision & accuracy    val model=NaiveBayes.train(training,lambda=1.0,modelType="multinomial")    val predictionAndLabel=test.map(p=>(model.predict(p.features),p.label))    val accuracy=1.0*predictionAndLabel.filter(x=>x._1==x._2).count()/test.count()    //save and load model    model.save(sc,"E:/Spark/models/NaiveBayes")    val sameModel=NaiveBayesModel.load(sc,"E:/Spark/models/NaiveBayes")  }}

2)集群模式

需要打包,然后通过spark-submit 提交到yarn client或者cluster中:

spark-submit --class myNaiveBayes --master yarn ScalaDemo.jar

import org.apache.spark.mllib.classification.{NaiveBayesModel, NaiveBayes}import org.apache.spark.mllib.linalg.Vectorsimport org.apache.spark.mllib.regression.LabeledPointimport org.apache.spark.{SparkConf, SparkContext}object myNaiveBayes {  def main(args:Array[String]) {    val conf = new SparkConf().setAppName("NaiveBayes")    val sc = new SparkContext(conf)    //initiated data and labeled    val data = sc.textFile("hdfs://nameservice1/user/hive/spark/data/sample_naive_bayes_data.txt")    val parsedData = data.map {      line =>        val parts = line.split(',')        LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split( ' ').map(_.toDouble)) )    }    // split data    val splits=parsedData.randomSplit(Array(0.6,0.4),seed=11L)    val training=splits(0)    val test=splits(1)    //model and calculated precision & accuracy    val model=NaiveBayes.train(training,lambda=1.0,modelType="multinomial")    val predictionAndLabel=test.map(p=>(model.predict(p.features),p.label))    val accuracy=1.0*predictionAndLabel.filter(x=>x._1==x._2).count()/test.count()    //save and load model    model.save(sc,"hdfs://nameservice1/user/hive/spark/NaiveBayes/model")    val sameModel=NaiveBayesModel.load(sc,"hdfs://nameservice1/user/hive/spark/NaiveBayes/model")  }}

3)pyspark 代码实例

可以直接利用spark-submit提交,但注意无法到集群(cluster模式目前不支持独立集群、 mesos集群以及python应用程序)

spark-submit pyNaiveBayes.py

#-*- coding:utf-8 -*-from pyspark.mllib.classification import NaiveBayes,NaiveBayesModelfrom pyspark.mllib.linalg import Vectorsfrom pyspark.mllib.regression import LabeledPointfrom pyspark import SparkContextif __name__=="__main__":    sc=SparkContext(appName="PythonPi")    def parseLine(line):        parts=line.split(',')        label=float(parts[0])        features=Vectors.dense([float(x) for x in parts[1].split(' ')])        return LabeledPoint(label,features)    data=sc.textFile("hdfs://nameservice1/user/hive/spark/data/sample_naive_bayes_data.txt").map(parseLine)    training,test=data.randomSplit([0.6,0.4],seed=0)    model=NaiveBayes.train(training,1.0)    predictionAndLabel=test.map(lambda p:(model.predict(p.features),p.label))    accuracy=1.0*predictionAndLabel.filter(lambda(x,v):x==v).count()/test.count()    model.save(sc, "hdfs://nameservice1/user/hive/spark/PythonNaiveBayes/model")    sameModel = NaiveBayesModel.load(sc, "hdfs://nameservice1/user/hive/spark/PythonNaiveBayes/model")}

3.  Python 

from sklearn import naive_bayesimport random##拆分训练集和测试集def SplitData(data,M,k,seed):    test=[]    train=[]    random.seed(seed)    for line in data:        if random.randint(0,M)==k:            test.append(''.join(line))        else:            train.append(''.join(line))    return train,test##按分割符拆分X,Ydef parseData(data,delimiter1,delimiter2):    x=[]    y=[]    for line in data:        parts = line.split(delimiter1)        x1 = [float(a) for a in parts[1].split(delimiter2)]        y1 = float(parts[0])        ##print x1,y1        x.append(x1)        y.append(y1)    return x,y##读取数据data=open('e:/java_ws/scalademo/data/sample_naive_bayes_data.txt','r')training,test=SplitData(data,4,2,10)trainingX,trainingY=parseData(training,',',' ')testX,testY=parseData(test,',',' ')##建模model=naive_bayes.GaussianNB()model.fit(trainingX,trainingY)##评估for b in testX:    print(model.predict(b),b)
posted on
2016-11-22 11:52 阅读(
...) 评论(
...)

转载于:https://www.cnblogs.com/skyEva/p/6088653.html

你可能感兴趣的文章
三、数据库查询补充
查看>>
Chrome调试javacript禁止缓存
查看>>
发现一个时隐时现的bug!
查看>>
RTEMS与通用操作系统的不同点总结
查看>>
linux网络编程 基于TCP的程序开发
查看>>
学习进度第三周
查看>>
float 浮动详解
查看>>
php中time()与$_SERVER[REQUEST_TIME]用法区别
查看>>
truncate table
查看>>
跟我一起学习ASP.NET 4.5 MVC4.0 (转)
查看>>
我的vim(持续更新)
查看>>
关于UIP协议栈主动发送数据
查看>>
CocoaPods ReactiveCocoa 学习实践一 之 配置环境
查看>>
数据结构-顺序输出数字
查看>>
Hadoop 2.0 Yarn代码:ResourcesManager端代码_RM端各模块服务的启动
查看>>
课后作业-阅读任务-阅读提问-1
查看>>
poj2407(欧拉函数模板题)
查看>>
mysql安装步骤
查看>>
Unix 入门
查看>>
DD测磁盘读写性能
查看>>