在Ubuntu上运行pyspark.mllib

我试图链接在Python中的Spark。下面的代码是test.py ，我把它放在~/spark/python ：

 from pyspark import SparkContext, SparkConf from pyspark.mllib.fpm import FPGrowth conf = SparkConf().setAppName(appName).setMaster(master) sc = SparkContext(conf=conf) data = sc.textFile("data/mllib/sample_fpgrowth.txt") transactions = data.map(lambda line: line.strip().split(' ')) model = FPGrowth.train(transactions, minSupport=0.2, numPartitions=10) result = model.freqItemsets().collect() for fi in result: print(fi)

而我运行python test.py得到这个错误消息：

 Exception in thread "main" java.lang.IllegalStateException: Library directory '/home/user/spark/lib_managed/jars' does not exist. at org.apache.spark.launcher.CommandBuilderUtils.checkState(CommandBuilderUtils.java:249) at org.apache.spark.launcher.AbstractCommandBuilder.buildClassPath(AbstractCommandBuilder.java:208) at org.apache.spark.launcher.AbstractCommandBuilder.buildJavaCommand(AbstractCommandBuilder.java:119) at org.apache.spark.launcher.SparkSubmitCommandBuilder.buildSparkSubmitCommand(SparkSubmitCommandBuilder.java:195) at org.apache.spark.launcher.SparkSubmitCommandBuilder.buildCommand(SparkSubmitCommandBuilder.java:121) at org.apache.spark.launcher.Main.main(Main.java:86) Traceback (most recent call last): File "test.py", line 6, in <module> conf = SparkConf().setAppName(appName).setMaster(master) File "/home/user/spark/python/pyspark/conf.py", line 104, in __init__ SparkContext._ensure_initialized() File "/home/user/spark/python/pyspark/context.py", line 245, in _ensure_initialized SparkContext._gateway = gateway or launch_gateway() File "/home/user/spark/python/pyspark/java_gateway.py", line 94, in launch_gateway raise Exception("Java gateway process exited before sending the driver its port number") Exception: Java gateway process exited before sending the driver its port number

我把test.py移动到~/spark ，我得到：

 Traceback (most recent call last): File "test.py", line 1, in <module> from pyspark import SparkContext, SparkConf ImportError: No module named pyspark

我从官方网站克隆Spark项目。操作系统：Ubuntu Java版本：1.7.0_79 Python版本：2.7.11

任何人都可以给我一些提示，以解决这个问题？

Spark程序必须通过“Spark-submit”提交。更多信息：文档。

你应该尝试运行： $SPARK_HOME/bin/spark-submit test.py而不是python test.py

如果您没有设置SPARK_HOME并将其lib添加到PYTHONPATH 。

也，

我从官方网站克隆Spark项目

这不被推荐，因为它可能会给很多依赖性问题。您可以尝试使用Hadoop 下载预先构建的版本，然后在本地模式下使用此处的说明进行测试。