如何使用crawler4j刮取?

现在我已经四个小时了,我根本看不出我做错了什么。 我有两个文件:

  1. MyCrawler.java
  2. Controller.java

MyCrawler.java

import edu.uci.ics.crawler4j.crawler.Page; import edu.uci.ics.crawler4j.crawler.WebCrawler; import edu.uci.ics.crawler4j.parser.HtmlParseData; import edu.uci.ics.crawler4j.url.WebURL; import java.util.List; import java.util.regex.Pattern; import org.apache.http.Header; public class MyCrawler extends WebCrawler { private final static Pattern FILTERS = Pattern.compile(".*(\\.(css|js|bmp|gif|jpe?g" + "|png|tiff?|mid|mp2|mp3|mp4" + "|wav|avi|mov|mpeg|ram|m4v|pdf" + "|rm|smil|wmv|swf|wma|zip|rar|gz))$"); /** * You should implement this function to specify whether the given url * should be crawled or not (based on your crawling logic). */ @Override public boolean shouldVisit(WebURL url) { String href = url.getURL().toLowerCase(); return !FILTERS.matcher(href).matches() && href.startsWith("http://www.ics.uci.edu/"); } /** * This function is called when a page is fetched and ready to be processed * by your program. */ @Override public void visit(Page page) { int docid = page.getWebURL().getDocid(); String url = page.getWebURL().getURL(); String domain = page.getWebURL().getDomain(); String path = page.getWebURL().getPath(); String subDomain = page.getWebURL().getSubDomain(); String parentUrl = page.getWebURL().getParentUrl(); String anchor = page.getWebURL().getAnchor(); System.out.println("Docid: " + docid); System.out.println("URL: " + url); System.out.println("Domain: '" + domain + "'"); System.out.println("Sub-domain: '" + subDomain + "'"); System.out.println("Path: '" + path + "'"); System.out.println("Parent page: " + parentUrl); System.out.println("Anchor text: " + anchor); if (page.getParseData() instanceof HtmlParseData) { HtmlParseData htmlParseData = (HtmlParseData) page.getParseData(); String text = htmlParseData.getText(); String html = htmlParseData.getHtml(); List<WebURL> links = htmlParseData.getOutgoingUrls(); System.out.println("Text length: " + text.length()); System.out.println("Html length: " + html.length()); System.out.println("Number of outgoing links: " + links.size()); } Header[] responseHeaders = page.getFetchResponseHeaders(); if (responseHeaders != null) { System.out.println("Response headers:"); for (Header header : responseHeaders) { System.out.println("\t" + header.getName() + ": " + header.getValue()); } } System.out.println("============="); } } 

Controller.java

 package edu.crawler; import edu.uci.ics.crawler4j.crawler.Page; import edu.uci.ics.crawler4j.crawler.WebCrawler; import edu.uci.ics.crawler4j.parser.HtmlParseData; import edu.uci.ics.crawler4j.url.WebURL; import java.util.List; import java.util.regex.Pattern; import org.apache.http.Header; import edu.uci.ics.crawler4j.crawler.CrawlConfig; import edu.uci.ics.crawler4j.crawler.CrawlController; import edu.uci.ics.crawler4j.fetcher.PageFetcher; import edu.uci.ics.crawler4j.robotstxt.RobotstxtConfig; import edu.uci.ics.crawler4j.robotstxt.RobotstxtServer; public class Controller { public static void main(String[] args) throws Exception { String crawlStorageFolder = "../data/"; int numberOfCrawlers = 7; CrawlConfig config = new CrawlConfig(); config.setCrawlStorageFolder(crawlStorageFolder); /* * Instantiate the controller for this crawl. */ PageFetcher pageFetcher = new PageFetcher(config); RobotstxtConfig robotstxtConfig = new RobotstxtConfig(); RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher); CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer); /* * For each crawl, you need to add some seed urls. These are the first * URLs that are fetched and then the crawler starts following links * which are found in these pages */ controller.addSeed("http://www.ics.uci.edu/~welling/"); controller.addSeed("http://www.ics.uci.edu/~lopes/"); controller.addSeed("http://www.ics.uci.edu/"); /* * Start the crawl. This is a blocking operation, meaning that your code * will reach the line after this only when crawling is finished. */ controller.start(MyCrawler, numberOfCrawlers); } } 

结构如下:

 java/MyCrawler.java java/Controller.java jars/... --> all the jars crawler4j 

我尝试在WINDOWS机器上使用下面的代码进行编译:

 javac -cp "C:\xampp\htdocs\crawlcrowd\www\java\jars\*;C:\xampp\htdocs\crawlcrowd\www\java\*" MyCrawler.java 

这完美的作品,我最终:

 java/MyCrawler.class 

但是,当我input:

 javac -cp "C:\xampp\htdocs\crawlcrowd\www\java\jars\*;C:\xampp\htdocs\crawlcrowd\www\java\*" Controller.java 

它爆炸与:

 Controller.java:50: error: cannot find symbol controller.start(MyCrawler, numberOfCrawlers); ^ symbol: variable MyCrawler location: class Controller 1 error 

所以,我觉得我不是在做一些我需要做的事情。 有些东西会使这个新的可执行类“知道”MyCrawler.class。 我已经尝试在命令行javac部分的类path摆弄。 我也试着把它设置在我的环境variables….没有运气。

任何想法如何我能得到这个工作?

UPDATE

我从Google Code页面本身获得了大部分代码。 但我不知道必须去那里。 即使我尝试这个:

 MyCrawler mc = new MyCrawler(); 

没有运气。 不知何故Controller.class不知道关于MyCrawler.class。

更新2

我不认为这很重要,因为问题显然是找不到类,但无论如何,这里是“CrawlController控制器”的签名。 从这里采取。

  /** * Start the crawling session and wait for it to finish. * * @param _c * the class that implements the logic for crawler threads * @param numberOfCrawlers * the number of concurrent threads that will be contributing in * this crawling session. */ public <T extends WebCrawler> void start(final Class<T> _c, final int numberOfCrawlers) { this.start(_c, numberOfCrawlers, true); } 

实际上,我在传递“MyCrawler”时会通过“爬虫”。 问题是应用程序不知道MyCrawler是什么。

想到几件事情:

  1. 你的MyCrawler是否扩展edu.uci.ics.crawler4j.crawler.WebCrawler?

     public class MyCrawler extends WebCrawler 
  2. 你传递MyCrawler.class(即,作为一个类)到controller.start?

     controller.start(MyCrawler.class, numberOfCrawlers); 

这两个都需要满足,才能让控制器编译和运行。 另外,Crawler4j在这里有一些很好的例子:

https://code.google.com/p/crawler4j/source/browse/src/test/java/edu/uci/ics/crawler4j/examples/basic/BasicCrawler.java

https://code.google.com/p/crawler4j/source/browse/src/test/java/edu/uci/ics/crawler4j/examples/basic/BasicCrawlController.java

这两个类将立即编译并运行(即BasicCrawlController),所以如果遇到任何问题,这是一个很好的起点。

在这里你传递一个类名MyCrawler作为参数。

 controller.start(MyCrawler, numberOfCrawlers); 

我认为类名不应该是一个参数。

我也在爬行上工作一点点!