无法使用spark写入Elasticsearch
问题描述:
elasticsearch服务器存在于版本为5.4.1的Linux服务器上。无法使用spark写入Elasticsearch
使用的spark集群是spark-2.2.0-bin-hadoop2.7。 我将spark.jars.packages org.elasticsearch:elasticsearch-spark-20_2.11:5.4.1
添加到spark-defaults.conf 启动主服务器和从服务器成功,并且可以在localhost:8080处访问spark webui。 开始于./start-master.sh
和./start-slave.sh spark://ApacheFlink:7077
我使用Intellij IDEA和sbt。 使用的scala版本是2.11.8
这里是scala代码。
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.elasticsearch.spark._
object TestInput {
def main(args: Array[String]): Unit = {
println("Hello, world")
val conf = new SparkConf().setAppName("TestInput").setMaster("spark://ApacheFlink:7077")
conf.set("es.nodes","elasticserver")
conf.set("es.port","9200")
conf.set("es.index.auto.create", "true")
val sc = new SparkContext(conf)
val numbers = Map("one" -> 1, "two" -> 2, "three" -> 3)
val airports = Map("arrival" -> "Otopeni", "SFO" -> "San Fran")
sc.makeRDD(Seq(numbers, airports)).saveToEs("test/TestInput")
}
}
我玩过很多sbt依赖项。这些是我的发现,直到现在。
所有我尝试使用scalaVersion := 2.11.8
libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "2.2.0" % "provided"
[error] /home/foo/Desktop/AnotherOne/src/main/scala/TestInput.scala:4: object elasticsearch is not a member of package org
[error] import org.elasticsearch.spark._
[error] ^
[error] /home/foo/Desktop/AnotherOne/src/main/scala/TestInput.scala:20: value saveToEs is not a member of org.apache.spark.rdd.RDD[scala.collection.immutable.Map[String,Any]]
[error] sc.makeRDD(Seq(numbers, airports)).saveToEs("test/TestInput")
[error] ^
[error] two errors found
[error] (compile:compileIncremental) Compilation failed
第二次尝试:
libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "2.2.0" % "provided"
libraryDependencies += "org.elasticsearch" % "elasticsearch-spark-20_2.11" % "5.4.1"
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream
at org.apache.spark.SparkConf.loadFromSystemProperties(SparkConf.scala:73)
at org.apache.spark.SparkConf.<init>(SparkConf.scala:68)
at org.apache.spark.SparkConf.<init>(SparkConf.scala:55)
at TestInput$.main(TestInput.scala:11)
at TestInput.main(TestInput.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.FSDataInputStream
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 5 more
3试;
libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "2.2.0"
libraryDependencies += "org.elasticsearch" % "elasticsearch-spark-20_2.11" % "5.4.1"
17/08/09 13:43:09 WARN TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1, 192.168.1.111, executor 0): java.lang.ClassNotFoundException: org.elasticsearch.spark.rdd.EsSpark$$anonfun$doSaveToEs$1
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1826)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1713)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2000)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:80)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
我什至尝试导入elasticseach-hadoop与其他错误消息。 我现在的问题很简单。 我在做什么错?我目前没有更多的想法。 我的火花簇丢失了什么?
答
你可能在作为elasticsearch-spark模块的依赖关系的火花瓶和你的build.sbt之间有冲突。
我得到它的工作就这样,在我build.sbt:
scalaVersion := "2.11.8"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.1.1" % "compile",
"org.apache.spark" %% "spark-sql" % "2.1.1" % "compile",
"org.apache.spark" %% "spark-mllib" % "2.1.1" % "compile",
"org.elasticsearch" %% "elasticsearch-spark-20" % "5.0.2" excludeAll ExclusionRule(organization = "org.apache.spark")
)
当我尝试的进口线程得到这个错误'异常“主要” java.lang.NoClassDefFoundError:组织/阿帕奇/火花/ SparkConf' – user2811630
如果我从位于spar-2-2-0-bin-hadoop2.7的spark shell内运行代码,一切运行良好。只有问题似乎是intellij执行 – user2811630
你有没有下载hadoop,指向任何配置文件中的任何hadoop相关? @GPI – user2811630