Apache Nutch:FetcherJob在Gora中抛出NoSuchElementException

我正在运行Apache Nutch 2.3.1,它使用Gora 0.6.1。 我已经按照这里的说明: http : //wiki.apache.org/nutch/RunNutchInEclipse

它与InjectorJob运行良好。

现在我正在运行FetcherJob ,Gora使用MemStore作为数据存储。 我有gora.properties包含

 gora.datastore.default=org.apache.gora.memory.store.MemStore 

这抛出:

 2016-10-02 22:55:54,605 ERROR mapreduce.GoraRecordReader (GoraRecordReader.java:nextKeyValue(121)) - Error reading Gora records: null 2016-10-02 22:55:54,605 INFO mapred.MapTask (MapTask.java:flush(1460)) - Starting flush of map output 2016-10-02 22:55:54,614 INFO mapred.LocalJobRunner (LocalJobRunner.java:runTasks(456)) - map task executor complete. 2016-10-02 22:55:54,615 WARN mapred.LocalJobRunner (LocalJobRunner.java:run(560)) - job_local874667143_0001 java.lang.Exception: java.lang.RuntimeException: java.util.NoSuchElementException at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522) Caused by: java.lang.RuntimeException: java.util.NoSuchElementException at org.apache.gora.mapreduce.GoraRecordReader.nextKeyValue(GoraRecordReader.java:122) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:556) at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80) at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.util.NoSuchElementException at java.util.concurrent.ConcurrentSkipListMap.firstKey(ConcurrentSkipListMap.java:2036) at org.apache.gora.memory.store.MemStore.execute(MemStore.java:128) at org.apache.gora.query.impl.QueryBase.execute(QueryBase.java:73) at org.apache.gora.mapreduce.GoraRecordReader.executeQuery(GoraRecordReader.java:67) at org.apache.gora.mapreduce.GoraRecordReader.nextKeyValue(GoraRecordReader.java:109) ... 12 more 2016-10-02 22:55:55,383 INFO mapreduce.Job (Job.java:monitorAndPrintJob(1360)) - Job job_local874667143_0001 running in uber mode : false 2016-10-02 22:55:55,385 INFO mapreduce.Job (Job.java:monitorAndPrintJob(1367)) - map 0% reduce 0% 2016-10-02 22:55:55,387 INFO mapreduce.Job (Job.java:monitorAndPrintJob(1380)) - Job job_local874667143_0001 failed with state FAILED due to: NA 2016-10-02 22:55:55,396 INFO mapreduce.Job (Job.java:monitorAndPrintJob(1385)) - Counters: 0 Exception in thread "main" java.lang.RuntimeException: job failed: name=, jobid=job_local874667143_0001 at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:119) at org.apache.nutch.fetcher.FetcherJob.run(FetcherJob.java:205) at org.apache.nutch.fetcher.FetcherJob.fetch(FetcherJob.java:251) at org.apache.nutch.fetcher.FetcherJob.run(FetcherJob.java:314) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.nutch.fetcher.FetcherJob.main(FetcherJob.java:321) 

这种情况发生在Nutch和Gora深处,我不知道为什么会这样。 我尝试升级到Gora 0.8,但同样的问题。 我试图把Gora降到0.6,同样的问题。 我想切换到另一个数据存储,如hBase,但这对我现在需要的东西有点矫枉过正。

请帮我弄清楚这一点。

我确认问题出在MemStore上。

在0.6.1有一个错误: https : //github.com/apache/gora/blob/apache-gora-0.6.1/gora-core/src/main/java/org/apache/gora/memory/store /MemStore.java#L128

这已经解决了主: https : //github.com/apache/gora/blob/master/gora-core/src/main/java/org/apache/gora/memory/store/MemStore.java#L155 ,访问#firstKey()有一个警卫#isEmpty()

但是,不要试图更新Gora 0.7-SNAPSHOT,因为Nutch现在不适应它。

编辑

如果你想用Nutch 2.x来使用Gora-0.7-SNAPSHOT,也许你可以这样做:

  1. 使用0.7-SNAPSHOT下载Gora的主分支
  2. mvn install在gora /安装在maven的本地存储库
  3. 将这个补丁应用到Nutch: https ://paste.apache.org/jjqz Nutch 2.3.1将与Gora 0.7-SNAPSHOT
  4. 做Nutch的教程的东西

我希望它的作品:)

编辑2

关于使用HBase,做一个本地安装实验是很容易的。

  1. 正如Nutch2Tutorial所述 ,下载HBase 0.98.8-hadoop2
  2. 在目录中/home/you/hbase tar.gz文件,例如: /home/you/hbase
  3. cd /home/you/hbase/bin
  4. ./start-hbase.sh

现在你已经可以运行HBase了。 配置Nutch:

ivy / ivy.xml:看看@ Emmanuel对HBase的ivy依赖配置的评论。

gora.properties:

 gora.datastore.default=org.apache.gora.hbase.store.HBaseStore gora.datastore.autocreateschema=true gora.datastore.scanner.caching=100 

Nutch的-site.xml中:

 <configuration> <property> <name>storage.data.store.class</name> <value>org.apache.gora.hbase.store.HBaseStore</value> <description>Default class for storing data</description> </property> </configuration> 

完成。 它将采取HBase的所有默认配置:localhost,/ tmp / …,blablabla