hadoop+nutch+mmseg4j

1. 在centos上安装nutch：

# svn co http://svn.apache.org/repos/asf/nutch/tags/release-1.6/

检出完成之后，进入nutch的主文件夹：

# cd release-1.6

然后进行构建;

#ant

#ls

#ls –l (生成build和runtime两个文件夹)

#ls runtime (下面有deploy和local两个文件夹，代表俩种运行方式)

# cd r*

#cd l* (进入local模式)

#ls （bin conf lib plugins test）

# mkdir urls (存放URL的文件夹)

# vi urls/url.txt （文本中写抓取URL网址）

#bin/nutch (运行)

#bin/nutch crawl

#nohup bin/nutch crawl urls –dir data –depth 3 –threads 100 &

#ls (运行之后会生成俩个东西 logs和 nohup.out)

#ls logs （显示hadoop.log）

#cat n*.out (报错如下)

报错：

hadoop+nutch+mmseg4j

解决：

hadoop+nutch+mmseg4j

<name>http.agent.name</name>

<value>nutch</value>

<description>HTTP 'User-Agent' request header. MUST NOT be empty -

please set this to a single word uniquely related to your organization.

NOTE: You should also check other related properties:

http.robots.agents

http.agent.description

http.agent.url

http.agent.email

http.agent.version

and set their values appropriately.

</description>

</property>

让配置文件生效需要再次执行ant

# cd.. (回到release-1.6目录下)

# ant

然后回到local模式下，重新进行编译会报错

解决：

hadoop+nutch+mmseg4j

再次运行：http://blog.tianya.cn

把url.txt中的https改为http

重新抓取：结果如下：

抓取天涯博客的结果：

nutch架构图;

Injector：注入url

Generator：生成抓取列表

Fetcher：抓取网页

PareseSegment：解析网页

CrawlDb：更新抓去列表

以上便是Nutch的一个执行周期，需要注意的似乎Injector只有在第一次执行的时候需要从urls注入，其他步骤后来成为一个循环

content：放抓取网页的源代码

crawl——generate：日志列表

crawl——fetch:每一个url的抓取状态(成功抓取或抛出异常)

crawl——parse：每一个url的解析状态（解析成功或解析失败）

对解析出来的内容：

parse——text：页面本身的文本内容

parse——data页面的元数据

1. 简单日志，产生抓取列表

2. 从网上抓取

3. 对抓取的网页进行分析

4. 将抓取的url状态以及新发现的url写回crawlDb

开发测试适合使用单机模式，但是生产中使用hadoop集群模式。

安装配置SOLR4.2

1.下载安装包

2. tar -xzvf solr-4.2.0.tgz解压

3. cd solr-4.2.0/example

复制nutch的conf目录中的schema-solr4.xml文件到solr/collection1/conf目录，改名为schema.xml，覆盖原来文件

5. 修改solr/collection1/conf/schema.xml，在<fields>下增加：

<field name="_version_" type="long" indexed="true" stored="true"/>

运行成功：

hadoop+nutch+mmseg4j

3、给SOLR4.2配置分词器mmseg4j

wget https://mmseg4j.googlecode.com/files/mmseg4j-1.9.1.v20130120-SNAPSHOT.zip

unzip mmseg4j-1.9.1.v20130120-SNAPSHOT.zip -d mmseg4j-1.9.1

将mmseg4j-1.9.1/dist/*.jar复制到solr下的lib目录
将schema.xml文件中的
  <tokenizerclass="solr.WhitespaceTokenizerFactory"/>
  和
  <tokenizer class="solr.StandardTokenizerFactory"/>
  替换为
  <tokenizerclass="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory"mode="complex"/>

运行SOLR并提交索引

启动SOLR服务器
java -jar start.jar &

Web界面

http://host2:8983

提交索引在local模式下：

bin/nutch solrindex http://192.168.3.200:8983/solrdata/crawldb -linkdb data/linkdb -dir data/segments

hadoop+nutch+mmseg4j