我正在EC2上运行一个独立的Spark集群,并且正在使用Spark-Cassandra连接器驱动程序编写一个应用程序,并尝试以编程方式将作业提交到Spark集群。 工作本身很简单:
public static void main(String[] args) { SparkConf conf; JavaSparkContext sc; conf = new SparkConf() .set("spark.cassandra.connection.host", host); conf.set("spark.driver.host", "[my_public_ip]"); conf.set("spark.driver.port", "15000"); sc = new JavaSparkContext("spark://[spark_master_host]","test",conf); CassandraJavaRDD<CassandraRow> rdd = javaFunctions(sc).cassandraTable( "keyspace", "table"); System.out.println(rdd.first().toString()); sc.stop(); }
当我在EC2集群的Spark Master节点中运行时,它运行良好。 我试图在远程Windows客户端运行这个。 问题来自这两行:
conf.set("spark.driver.host", "[my_public_ip]"); conf.set("spark.driver.port", "15000");
首先,如果我注释掉这两行,应用程序不会抛出exception,但执行程序没有运行,并显示以下日志:
14/12/06 22:40:03 INFO client.AppClient$ClientActor: Executor updated: app-20141207033931-0021/3 is now LOADING 14/12/06 22:40:03 INFO client.AppClient$ClientActor: Executor updated: app-20141207033931-0021/0 is now EXITED (Command exited with code 1) 14/12/06 22:40:03 INFO cluster.SparkDeploySchedulerBackend: Executor app-20141207033931-0021/0 removed: Command exited with code 1
从来没有结束,当我检查工作节点日志时,我发现:
14/12/06 22:40:21 ERROR security.UserGroupInformation: PriviledgedActionException as:[username] cause:java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] Exception in thread "main" java.lang.reflect.UndeclaredThrowableException: Unknown exception in doAs at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1134) at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:52) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:113) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:156) at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala) Caused by: java.security.PrivilegedActionException: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) ... 4 more Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:125) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:53) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:52) ... 7 more
我不知道这是什么,我的猜测是,可能工人节点无法连接到驱动程序,可能最初设置为:
14/12/06 22:39:30 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@[some_host_name]:52660] 14/12/06 22:39:30 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkDriver@[some_host_name]:52660]
显然,没有DNS将解决我的主机名…
因为我不能通过./spark-submit
脚本将部署模式设置为"client"
或"cluster"
(我认为这是荒谬的)。 我尝试在所有Spark Master Worker节点的/etc/hosts
中添加主机parsing"XX.XXX.XXX.XX [host-name]"
。
当然没有运气……那就引我到第二条,不要评论那两条线;
这给了我:
14/12/06 22:59:41 INFO Remoting: Starting remoting 14/12/06 22:59:41 ERROR Remoting: Remoting error: [Startup failed] [ akka.remote.RemoteTransportException: Startup failed at akka.remote.Remoting.akka$remote$Remoting$$notifyError(Remoting.scala:129) at akka.remote.Remoting.start(Remoting.scala:194) ...
原因:
Caused by: org.jboss.netty.channel.ChannelException: Failed to bind to: /[my_public_ip]:15000 at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272) at akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:391) at akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:388)
我再次检查了我的防火墙设置和路由器设置,确认我的防火墙是否是双错的; 和netstat -an
确认端口15000没有被使用(实际上我试图改变到几个可用的端口,没有运气); 我从我的集群ping
其他机器和机器的公共ip,没问题。
现在我完全搞砸了,我只是想尽办法解决这个问题。 有什么build议么? 任何帮助表示赞赏!
请检查您的安全组是否有15000个。