Nutch二次开发总结(二)

时间:2022-09-05 13:15:46

3搜索结果优化
使用nutch 0.9自带的程序包搜索的时候,存在一个冗余数据的情况。例如,如果想搜索关于姚明、易建联等的信息时,nutch默认会把网页中导航条或者一些标题等中包含姚明和易建联信息的页面检索出来,以腾讯为例,http://sports.qq.com/nba/的导航条部分包含了姚明和易建联。但这个页面的其他信息没有设计到姚明和易建联,所以这个页面可能实际上不是我们想要的;
还有一种情况,当我们想搜索“莎娃”的时,nutch会抓取到http://sports.qq.com/a/20090108/000407.htm,但实际上“莎娃”只是在这个页面的右边超链接款上有包含“莎娃”的信息。这个页面也可能不是我们想要的。
nutch通过HTMLParser把爬取到的网页解析成文本格式,保存到本地硬盘,然后再通过lucene建立索引,如果想达到去除无用信息目的,就要从图中红色标注的部分入手。

优化搜索结果方法1:
改写org.apache.nutch.parse.html.DOMContentUtils文件,修改方法getTextHelper方法:
getTextHelper源代码:
private boolean getTextHelper(StringBuffer sb, Nodenode,                                              booleanabortOnNestedAnchors,                                             intanchorDepth) {
    if ("script".equalsIgnoreCase(node.getNodeName())) {
      return false;
    }
    if ("style".equalsIgnoreCase(node.getNodeName())) {
      return false;
    }
    if (abortOnNestedAnchors && "a".equalsIgnoreCase(node.getNodeName())) {
      anchorDepth++;
      if (anchorDepth > 1)
        return true;
    }
    if (node.getNodeType() == Node.COMMENT_NODE) {
      return false;
    }
    if (node.getNodeType() == Node.TEXT_NODE) {
      // cleanup and trim the value
      String text = node.getNodeValue();
      text = text.replaceAll("//s+", " ");
      text = text.trim();
      if (text.length() > 0) {
        if (sb.length() > 0) sb.append(' ');
      sb.append(text);
      }
    }
    boolean abort = false;
    NodeList children = node.getChildNodes();
    if (children != null) {
      int len = children.getLength();
      for (int i = 0; i < len; i++) {
        if (getTextHelper(sb, children.item(i),
                          abortOnNestedAnchors, anchorDepth)) {
          abort = true;
          break;
        }
      }
    }
    return abort;
}

自定义方法(实际上就是过滤掉解析下来包含<a href=" ">的信息):
private boolean getTextHelper(StringBuffer sb, Node node,
                                             boolean abortOnNestedAnchors,
                                             int anchorDepth) {
    if ("script".equalsIgnoreCase(node.getNodeName())) {
      return false;
    }
    if ("style".equalsIgnoreCase(node.getNodeName())) {
      return false;
    }
    if (abortOnNestedAnchors && "a".equalsIgnoreCase(node.getNodeName())) {
      anchorDepth++;
      if (anchorDepth > 1)
        return true;
    }
    if (node.getNodeType() == Node.COMMENT_NODE) {
      return false;
    }

    if (node.getNodeType() == Node.TEXT_NODE) { //node是解析下来的网页源文件所包含的内容

                           //Node.TEXT_NODE:节点属于文本节点<body><div><a href><td>等标签
      // cleanup and trim the value
      String text = node.getNodeValue();         //获取节点里面的文本内容,相当与去掉HTML标签

       /* 过滤掉包含的特殊字符*/
      text = text.replaceAll("//s+", " ");
      text = text.replace("【", "");
      text = text.replace("】", "");
      text = text.replace("[", "");
      text = text.replace("]", "");
      text = text.replace("|", "");
      text = text.replace("┊", "");
      text = text.replace("?", "");
      text = text.replace("?", "");
      text = text.replace("?", "");
      text = text.replace("|", "");
      text = text.replace("、", "");
      text = text.replace("-", "");
      text = text.replace("~", "");
      text = text.replace("!", "");
      text = text.replace("@", "");
      text = text.replace("#", "");
      text = text.replace("$", "");
      text = text.replace("^", "");
      text = text.replace("*", "");
      text = text.replace("(", "");
      text = text.replace(")", "");
      text = text.replace("%", "");
      text = text.replace(">", "");
      text = text.replace("?", "");
      text = text.replace("%", "");

      text = text.trim();

     temp = node.getParentNode().toString();                         //获取父节点的标签

      if (text.length() > 0 && temp.indexOf("A:") == -1) {         //如果属于<a href>,则过滤...

        if (sb.length() > 0) sb.append(' ');
      sb.append(text);
      }
    }
    boolean abort = false;
    NodeList children = node.getChildNodes();
    if (children != null) {
      int len = children.getLength();
      for (int i = 0; i < len; i++) {
        if (getTextHelper(sb, children.item(i),
                          abortOnNestedAnchors, anchorDepth)) {
          abort = true;
          break;
        }
      }
    }
    return abort;
}

4 高亮显示
高亮显示涉及到的类为:org.apache.nutch.searcher.Summary,修改该类即可:
public String toHtml(boolean encode) {
    Fragment fragment = null;
    StringBuffer buf = new StringBuffer();
    for (int i=0; i<fragments.size(); i++) {
      fragment = (Fragment) fragments.get(i);
      if (fragment.isHighlight()) {
        buf.append("<span style=/"color:red /" >")    // 修改前语句 buf.append("<span style=/"hightlight /" >")   
           .append(encode ? Entities.encode(fragment.getText())
                          : fragment.getText())
           .append("</span>");
      } else if (fragment.isEllipsis()) {
        buf.append("<span class=/"ellipsis/"> ... </span>");
      } else {
        buf.append(encode ? Entities.encode(fragment.getText())
                          : fragment.getText());
      }
    }
    return buf.toString();
}

5分页功能
默认情况下,搜索引擎Nutch在查询搜索结果时,只有下一页功能。主要是Nutch采用了“80/20”原则,即返回前面最相关的记录,而基本上的用户都不会去关心第三页以扣的内容。实现了分页功能,并把是show all hits删去:
1 删去show all hits(在search.jsp中)修改:
         int hitsPerSite = 0; // max hits per site
2分页功能
<table align="center">
     <tr>
       <td>
         <%
           if (start >= hitsPerPage) // more hits to show
              {
          %>
             <form name="pre" action="../search.jsp" method="get">
                  <input type="hidden" name="query" value="<%=htmlQueryString%>">
                  <input type="hidden" name="lang" value="<%=queryLang%>">
                  <input type="hidden" name="start" value="<%=start - hitsPerPage%>">
                  <input type="hidden" name="hitsPerPage" value="<%=hitsPerPage%>">
                  <input type="hidden" name="hitsPerSite" value="<%=hitsPerSite%>">
                  <input type="hidden" name="clustering" value="<%=clustering%>">
                  <input type="submit" value="上一页">
           <%} %>
             </form>
           <%
      int startnum=1;//页面中最前面的页码编号,我设定(满足)共10页,当页为第6页
                    if((int)(start/hitsPerPage)>=5)
                    startnum=(int)(start/hitsPerPage)-4;
                    for(int i=hitsPerPage*(startnum-1),j=0;i<=hits.getTotal()&&j<=10;)
                    {
                     %>
                     <td>
                     <form name="next" action="../search.jsp" method="get">
                        <input type="hidden" name="query" value="<%=htmlQueryString%>">
                        <input type="hidden" name="lang" value="<%=queryLang%>">
                        <input type="hidden" name="start" value="<%=i%>">
                        <input type="hidden" name="hitsPerPage" value="<%=hitsPerPage%>">
                        <input type="hidden" name="hitsPerSite" value="<%=hitsPerSite%>">
                        <input type="hidden" name="clustering" value="<%=clustering%>">
                        <input type="submit" value="<%=i/hitsPerPage+1 %>">
                    </form>
                    </td>
                    <%
                    i=i+5;
                    j++;
                    }
                     %>
                <td>
                    <%
         if ((hits.totalIsExact() && end < hits.getTotal()) // more hits to show
                                || (!hits.totalIsExact() && (hits.getLength() > start
                                + hitsPerPage))) {
                    %>
                   
                    <form name="next" action="../search.jsp" method="get">
                        <input type="hidden" name="query" value="<%=htmlQueryString%>">
                        <input type="hidden" name="lang" value="<%=queryLang%>">
                        <input type="hidden" name="start" value="<%=end%>">
                        <input type="hidden" name="hitsPerPage" value="<%=hitsPerPage%>">
                        <input type="hidden" name="hitsPerSite" value="<%=hitsPerSite%>">
                        <input type="hidden" name="clustering" value="<%=clustering%>">
                        <input type="submit" value="<i18n:message key="next"/>">//下一页
                    </form>
                    <%} %>
                    </td>
                  </tr>
                    </table>
<%
    i=i+5;
    j++;
}
这里的i应该改成i=i+10,才会以每页10记录显示
如果按照5条记录显示的话,上面的hitsPerPage定义要改变
否则会显示不对

本文第一部分内容见:http://hi.baidu.com/zhumulangma/blog/item/7b39adc294d13c130ff477e2.html