Ubuntu环境下Nutch+Tomcat 搭建简单的搜索引擎

时间:2022-03-15 08:53:45

简易的搜索引擎搭建

我的配置:

Nutch:1.2

Tomcat:7.0.57

1 Nutch设置

修改Nutch配置

1.1 修改conf/nutch-site.xml

 <?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <!--property>
<name>storage.data.store.class</name>
<value>org.apache.gora.hbase.store.HBaseStore</value>
<description>Default class for storing data</description>
</property>
<property>
<name>http.agent.name</name>
<value>xxx0624-ThinkPad-Edge</value>
</property--> <property>
<name>http.agent.name</name>
<value>nutch1.</value>
</property> <property>
<name>plugin.folders</name>
<value>./plugins</value>
</property> </configuration>

1.2 修改conf/crawl-urlfilter.txt

 # accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*sohu.com/

找到该处进行修改。我的是以sohu网为例。表示只爬取sohu.com结尾的网页。

1.3 增加文件夹

在nutch目录下mkdir一个新的文件夹 名字为urls,再在里面建立一个空的txt文件 名字为urls.txt。

在urls.txt中写入要爬取的网页地址:如http://www.sohu.com/

1.4 开始爬取

命令:

bin/nutch crawl urls/urls.txt -dir crawled -depth 5 -threads 5 -topN 200

crawled指爬取网页的结果的存储位置,当爬取结束时,会自动生成5个文件夹:crawldb,index,indexes,linkdb,segments

2 tomcat设置

2.1 将nutch编译后的war包放在tomcat的webapps下,再启动tomcat,再在生成的nutch1.2文件夹下修改WEB-INF/classes/nutch-sites.xml

<property>
<name>searcher.dir</name>
<value>/home/xxx0624/nutch-1.2/crawled</value>
</property>

这是设置抓取网页信息的文件位置

2.2 针对中文乱码修改

2.2.1 修改tomcat配置文件conf/server.xml

 <Connector port="" protocol="HTTP/1.1"
connectionTimeout=""
redirectPort=""
URIEncoding="UTF-8"
useBodyEncodingForURI="true"/>

增加其中的URIEncoding和useBodyEncodingForURI

2.2.2 修改nutch-1.2/cache.jsp

找到这一部分

 Metadata metaData = bean.getParseData(details).getContentMeta();
ParseData ParseData = bean.getParseData(details);
String content = null;
// String contentType = (String) metaData.get(Metadata.CONTENT_TYPE);
String contentType = ParseData.getMeta(Metadata.CONTENT_TYPE);
if (contentType.startsWith("text/html")) {
// FIXME : it's better to emit the original 'byte' sequence
// with 'charset' set to the value of 'CharEncoding',
// but I don't know how to emit 'byte sequence' in JSP.
// out.getOutputStream().write(bean.getContent(details)) may work,
// but I'm not sure.
//String encoding = (String) metaData.get("CharEncodingForConversion");
String encoding = ParseData.getMeta("CharEncodingForConversion");
if (encoding != null) {
try {
content = new String(bean.getContent(details), encoding);
}
catch (UnsupportedEncodingException e) {
// fallback to windows-1252
content = new String(bean.getContent(details), "windows-1252");
}
}
else
content = new String(bean.getContent(details),"GBK");
//content = new String(bean.getContent(details));

3 开始实验

重启tomcat

通过浏览器访问:http://localhost:8080/nutch-1.2