Subject: Answers to recent questions on Hive on Spark


Hi,

 

Thanks for heads up and comments.

 

Sounds like when it comes to using spark as the execution engine for Hive, we are in no man’s land so to speak. I have opened questions in both Hive and Spark user forums. Not much of luck for reasons that you alluded to.

 

Ok just to clarify the prebuild version of spark (as opposed get the source code and build your spec) works fine for me.

 

Components are

 

hadoop version

Hadoop 2.6.0

 

hive --version

Hive 1.2.1

 

Spark

version 1.5.2

 

It does what it says on the tin. For example I can start the master node OK start-master.sh.

 

 

Spark Command: /usr/java/latest/bin/java -cp /usr/lib/spark_1.5.2_bin/sbin/../conf/:/usr/lib/spark_1.5.2_bin/lib/spark-assembly-1.5.2-hadoop2.6.0.jar:/usr/lib/spark_1.5.2_bin/lib/datanucleus-core-3.2.10.jar:/usr/lib/spark_1.5.2_bin/lib/datanucleus-api-jdo-3.2.6.jar:/usr/lib/spark_1.5.2_bin/lib/datanucleus-rdbms-3.2.9.jar:/home/hduser/hadoop-2.6.0/etc/hadoop/ -Xms1g -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.master.Master --ip 127.0.0.1 --port 7077 --webui-port 8080

========================================

15/11/28 00:05:23 INFO master.Master: Registered signal handlers for [TERM, HUP, INT]

15/11/28 00:05:23 WARN util.Utils: Your hostname, rhes564 resolves to a loopback address: 127.0.0.1; using 50.140.197.217 instead (on interface eth0)

15/11/28 00:05:23 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind to another address

15/11/28 00:05:24 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

15/11/28 00:05:24 INFO spark.SecurityManager: Changing view acls to: hduser

15/11/28 00:05:24 INFO spark.SecurityManager: Changing modify acls to: hduser

15/11/28 00:05:24 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hduser); users with modify permissions: Set(hduser)

15/11/28 00:05:25 INFO slf4j.Slf4jLogger: Slf4jLogger started

15/11/28 00:05:25 INFO Remoting: Starting remoting

15/11/28 00:05:25 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkMaster@127.0.0.1:7077]

15/11/28 00:05:25 INFO util.Utils: Successfully started service 'sparkMaster' on port 7077.

15/11/28 00:05:25 INFO master.Master: Starting Spark master at spark://127.0.0.1:7077

15/11/28 00:05:25 INFO master.Master: Running Spark version 1.5.2

15/11/28 00:05:25 INFO server.Server: jetty-8.y.z-SNAPSHOT

15/11/28 00:05:25 INFO server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:8080

15/11/28 00:05:25 INFO util.Utils: Successfully started service 'MasterUI' on port 8080.

15/11/28 00:05:25 INFO ui.MasterWebUI: Started MasterWebUI at http://50.140.197.217:8080

15/11/28 00:05:25 INFO server.Server: jetty-8.y.z-SNAPSHOT

15/11/28 00:05:25 INFO server.AbstractConnector: Started SelectChannelConnector@rhes564:6066

15/11/28 00:05:25 INFO util.Utils: Successfully started service on port 6066.

15/11/28 00:05:25 INFO rest.StandaloneRestServer: Started REST server for submitting applications on port 6066

15/11/28 00:05:25 INFO master.Master: I have been elected leader! New state: ALIVE

 

However, I cannot use spark in place of MapReduce engine with this build. It fails

 

The instruction says download the source code for spark and build it by excluding Hive jar files so that you can use spark as the execution engine

 

Ok

 

I downloaded spark 1.5.2 source code and used the following to create the tarred and zipped file

 

./make-distribution.sh --name "hadoop2-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.4,parquet-provided"

 

After unpacking the file, I attempted to start the master node as above start-master.sh, However, regrettably it fails with the following error

 

 

Spark Command: /usr/java/latest/bin/java -cp /usr/lib/spark_1.5.2_build/sbin/../conf/:/usr/lib/spark_1.5.2_build/lib/spark-assembly-1.5.2-hadoop2.4.0.jar:/home/hduser/hadoop-2.6.0/etc/hadoop/ -Xms1g -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.master.Master --ip 127.0.0.1 --port 7077 --webui-port 8080

========================================

Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger

        at java.lang.Class.getDeclaredMethods0(Native Method)

        at java.lang.Class.privateGetDeclaredMethods(Class.java:2521)

        at java.lang.Class.getMethod0(Class.java:2764)

        at java.lang.Class.getMethod(Class.java:1653)

        at sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494)

        at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486)

Caused by: java.lang.ClassNotFoundException: org.slf4j.Logger

        at java.net.URLClassLoader$1.run(URLClassLoader.java:366)

        at java.net.URLClassLoader$1.run(URLClassLoader.java:355)

        at java.security.AccessController.doPrivileged(Native Method)

        at java.net.URLClassLoader.findClass(URLClassLoader.java:354)

        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)

        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)

        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)

        ... 6 more

 

 

I believe the problem lies in spark-assembly-1.5.2-hadoop2.4.0.jar file. Case in point, if I copy the jar file spark-assembly-1.5.2-hadoop2.6.0.jar to the lib directory above , I can start the master node.

 

hduser@rhes564::/usr/lib/spark_1.5.2_build/lib> mv spark-assembly-1.5.2-hadoop2.4.0.jar spark-assembly-1.5.2-hadoop2.4.0.jar_old

hduser@rhes564::/usr/lib/spark_1.5.2_build/lib> cp /usr/lib/spark_1.5.2_bin/lib/spark-assembly-1.5.2-hadoop2.6.0.jar .

 

hduser@rhes564::/usr/lib/spark_1.5.2_build/lib> cd ../sbin

hduser@rhes564::/usr/lib/spark_1.5.2_build/sbin> start-master.sh

starting org.apache.spark.deploy.master.Master, logging to /usr/lib/spark_1.5.2_build/sbin/../logs/spark-hduser-org.apache.spark.deploy.master.Master-1-rhes564.out

hduser@rhes564::/usr/l