在Hadoop分布式文件系统的索引和搜索

FROM:http://www.drdobbs.com/parallel/indexing-and-searching-on-a-hadoop-distr/226300241?pgno=3

在今天的信息饱和的世界，地理分布的数据，需要一种系统的巨大增长，有利于快速检索有意义的结果的解析。分布式数据的可搜索的索引去加速的过程很长的路要走。在这篇文章中，我演示了如何使用Lucene和Java的基本数据索引和搜索，如何使用RAM目录索引和搜索，如何创建居住在HDF的数据索引，以及如何搜索这些索引。由开发环境，Eclipse的Java 1.6的Lucene的2.4.0，3.4.2，和Hadoop 0.19.1上运行微软Windows XP SP3。

为了解决这个任务，我把Hadoop的。Apache Hadoop项目的开发可靠，可扩展，分布式计算开源软件，Hadoop分布式文件系统（HDFS）是专为跨广域网的存储和共享文件。HDFS是建立在商品硬件上运行，并提供了容错，资源管理，以及最重要的是，应用程序数据访问的高吞吐量。

在本地文件系统上创建索引

第一步是创建一个索引存储在本地文件系统上的数据。开始通过创建一个Eclipse项目中，创建一个类，然后添加所需的JAR文件添加到项目。以这个例子发现在Web服务器中的日志文件的应用程序的数据：

2010-04-21 02:24:01 GET /blank 200 120

此数据被映射到某些字段：

2010-04-21 - 日期字段
2时24分01秒 - 时间字段
GET - 法域（GET或POST） - 我们将记为“CS-方法”
/空白 - 请求的URL字段 - 我们将表示为“CS-URI”
200 - 状态代码的请求 - 我们会记为“SC-状态”

120 - 时间采取现场（完成请求所需的时间）

目前在我们的样本文件的数据位于一个"E:\DataFile"名为“test.txt的”如下：

2010-04-21 02:24:01 GET /blank 200 120

2010-04-21 02:24:01 GET /US/registrationFrame 200 605

2010-04-21 02:24:02 GET /US/kids/boys 200 785

2010-04-21 02:24:02 POST /blank 304 56

2010-04-21 02:24:04 GET /blank 304 233

2010-04-21 02:24:04 GET /blank 500 567

2010-04-21 02:24:04 GET /blank 200 897

2010-04-21 02:24:04 POST /blank 200 567

2010-04-21 02:24:05 GET /US/search 200 658

2010-04-21 02:24:05 POST /US/shop 200 768

2010-04-21 02:24:05 GET /blank 200 347

我们要建立索引的数据出现在这个“test.txt的”文件，并保存到本地文件系统的索引。下面的Java代码，这样做。（注意每个部分的代码做什么的详细信息）的意见。

 // Creating IndexWriter object and specifying the path where Indexed

 //files are to be stored.

 IndexWriter indexWriter = new IndexWriter("E://DataFile/IndexFiles", new StandardAnalyzer(), true);

 // Creating BufferReader object and specifying the path of the file

 //whose data is required to be indexed.

 BufferedReader reader= new BufferedReader(new FileReader("E://DataFile/Test.txt"));

 String row=null;

 // Reading each line present in the file.

 while ((row=reader.readLine())!= null)

 {

 // Getting each field present in a row into an Array and file delimiter is "space separated"

 String Arow[] = row.split(" ");

 // For each row, creating a document and adding data to the document with the associated fields.

 org.apache.lucene.document.Document document = new org.apache.lucene.document.Document();

 document.add(new Field("date",Arow[0],Field.Store.YES,Field.Index.ANALYZED));

 document.add(new Field("time",Arow[1],Field.Store.YES,Field.Index.ANALYZED));

 document.add(newField ("cs-method",Arow[2],Field.Store.YES,Field.Index.ANALYZED));

 document.add(newField ("cs-uri",Arow[3],Field.Store.YES,Field.Index.ANALYZED));

 document.add(newField ("sc-status",Arow[4],Field.Store.YES,Field.Index.ANALYZED));

 document.add(newField ("time-taken",Arow[5],Field.Store.YES,Field.Index.ANALYZED));

 // Adding document to the index file.

 indexWriter.addDocument(document);

 }

 indexWriter.optimize();

 indexWriter.close();

 reader.close();

的Java代码一旦被执行，将创建和索引文件存放在“E :/ /DataFile/ IndexFiles的位置。”

现在，我们可以搜索索引文件中的数据，我们刚刚创建的。基本上，搜索的“场”的数据上完成。您可以使用Lucene搜索引擎支持各种搜索语义搜索，你可以在一个特定的字段或字段组合执行搜索。下面的Java代码搜索索引：

 // Creating Searcher object and specifying the path where Indexed files are stored.

 Searcher searcher = new IndexSearcher("E://DataFile/IndexFiles");

 Analyzer analyzer = new StandardAnalyzer();

 // Printing the total number of documents or entries present in the index file.

 System.out.println("Total Documents = "+searcher.maxDoc()) ;

 // Creating the QueryParser object and specifying the field name on

 //which search has to be done.

 QueryParser parser = new QueryParser("cs-uri", analyzer);

 // Creating the Query object and specifying the text for which search has to be done.

 Query query = parser.parse("/blank");

 // Below line performs the search on the index file and

 Hits hits = searcher.search(query);

 // Printing the number of documents or entries that match the search query.

 System.out.println("Number of matching documents = "+ hits.length());

 // Printing documents (or rows of file) that matched the search criteria.

 for (int i = 0; i < hits.length(); i++)

 {

     Document doc = hits.doc(i);

     System.out.println(doc.get("date")+" "+ doc.get("time")+ " "+

     doc.get("cs-method")+ " "+ doc.get("cs-uri")+ " "+ doc.get("sc-status")+ " "+ doc.get("time-taken"));

在这个例子中，搜索完成领域cs的uri的cs的uri的字段/空白内搜索的文本。因此，搜索代码运行时，所有的文件（或行）的CS-URI字段包含/空白，显示在输出中。的输出如下所示：

 Total Documents = 11

 Number of matching documents = 7

 2010-04-21 02:24:01 GET /blank 200 120

 2010-04-21 02:24:02 POST /blank 304 56

 2010-04-21 02:24:04 GET /blank 304 233

 2010-04-21 02:24:04 GET /blank 500 567

 2010-04-21 02:24:04 GET /blank 200 897

 2010-04-21 02:24:04 POST /blank 200 567

 2010-04-21 02:24:05 GET /blank 200 347

HDFS上的基于内存的索引

现在考虑数据的情况下，位于一个像Hadoop DFS分布式文件系统。上述代码将无法正常工作分布式数据上直接创建索引，所以我们不得不完成前几步的诉讼程序，如从HDFS数据复制到本地文件系统，创建索引的数据出现在本地文件系统，最后将索引文件存储到HDFS。同样的步骤将需要搜索。但这种方法耗时且最理想的，相反，让我们的索引和搜索我们的数据使用HDFS节点的内存中的数据是居住。

假设数据文件“Test.txt的”早期使用现在居住在HDFS上，里面一个工作目录文件夹，名为“/数据文件/ Test.txt的。” 创建另一个称为“/ IndexFiles”HDFS的工作目录里面的文件夹，我们生成的索引文件将被存储。下面的Java代码在内存中的文件存储在HDFS上创建索引文件：

 // Path where the index files will be stored.

 String Index_DIR="/IndexFiles/";

 // Path where the data file is stored.

 String File_DIR="/DataFile/test.txt";

 // Creating FileSystem object, to be able to work with HDFS

 Configuration config = new Configuration();

 config.set("fs.default.name","hdfs://127.0.0.1:9000/");

 FileSystem dfs = FileSystem.get(config);

 // Creating a RAMDirectory (memory) object, to be able to create index in memory.

 RAMDirectory rdir = new RAMDirectory();

 // Creating IndexWriter object for the Ram Directory

 IndexWriter indexWriter = new IndexWriter (rdir, new StandardAnalyzer(), true);

 // Creating FSDataInputStream object, for reading the data from "Test.txt" file residing on HDFS.

 FSDataInputStream filereader = dfs.open(new Path(dfs.getWorkingDirectory()+ File_DIR));

 String row=null;

 // Reading each line present in the file.

 while ((row=reader.readLine())!=null)

 {

 // Getting each field present in a row into an Array and file //delimiter is "space separated".

 String Arow[]=row.split(" ");

 // For each row, creating a document and adding data to the document

 //with the associated fields.

 org.apache.lucene.document.Document document = new org.apache.lucene.document.Document();

 document.add(new Field("date",Arow[0],Field.Store.YES,Field.Index.ANALYZED));

 document.add(new Field("time",Arow[1],Field.Store.YES,Field.Index.ANALYZED));

 document.add(new Field ("cs-method",Arow[2],Field.Store.YES,Field.Index.ANALYZED));

 document.add(new Field ("cs-uri",Arow[3],Field.Store.YES,Field.Index.ANALYZED));

 document.add(new Field ("sc-status",Arow[4],Field.Store.YES,Field.Index.ANALYZED));

 document.add(new Field ("time-taken",Arow[5],Field.Store.YES,Field.Index.ANALYZED));

 // Adding document to the index file.

 indexWriter.addDocument(document);

 }

 indexWriter.optimize();

 indexWriter.close();

 reader.close();

因此，对于“test.txt的”居住在HDFS上的文件，我们现在有在内存中创建索引文件。存储索引文件，在HDFS文件夹：

 // Getting files present in memory into an array.

 String fileList[]=rdir.list();

 // Reading index files from memory and storing them to HDFS.

 for (int i = 0; I < fileList.length; i++)

 {

     IndexInput indxfile = rdir.openInput(fileList[i].trim());

     long len = indxfile.length();

     int len1 = (int) len;

     // Reading data from file into a byte array.

     byte[] bytarr = new byte[len1];

     indxfile.readBytes(bytarr, 0, len1);

 // Creating file in HDFS directory with name same as that of

 //index file

 Path src = new Path(dfs.getWorkingDirectory()+Index_DIR+ fileList[i].trim());

     dfs.createNewFile(src);

     // Writing data from byte array to the file in HDFS

 FSDataOutputStream fs = dfs.create(new    Path(dfs.getWorkingDirectory()+Index_DIR+fileList[i].trim()),true);

     fs.write(bytarr);

     fs.close();

现在我们有必要的Test.txt的“数据文件创建并存储在HDFS目录的索引文件。

基于内存搜索HDFS上

我们现在可以搜索存储在HDFS中的索引。首先，我们必须使HDFS的索引文件在内存中进行搜索。下面的代码是用于这一过程：

 // Creating FileSystem object, to be able to work with HDFS

 Configuration config = new Configuration();

 config.set("fs.default.name","hdfs://127.0.0.1:9000/");

 FileSystem dfs = FileSystem.get(config);

 // Creating a RAMDirectory (memory) object, to be able to create index in memory.

 RAMDirectory rdir = new RAMDirectory();

 // Getting the list of index files present in the directory into an array.

 Path pth = new Path(dfs.getWorkingDirectory()+Index_DIR);

 FileSystemDirectory fsdir = new FileSystemDirectory(dfs,pth,false,config);

 String filelst[] = fsdir.list();

 FSDataInputStream filereader = null;

 for (int i = 0; i<filelst.length; i++)

 {

 // Reading data from index files on HDFS directory into filereader object.

 filereader = dfs.open(new Path(dfs.getWorkingDirectory()+Index_DIR+filelst[i]));

     int size = filereader.available();

     // Reading data from file into a byte array.

     byte[] bytarr = new byte[size];

     filereader.read(bytarr, 0, size);

 // Creating file in RAM directory with names same as that of

 //index files present in HDFS directory.

     IndexOutput indxout = rdir.createOutput(filelst[i]);

     // Writing data from byte array to the file in RAM directory

     indxout.writeBytes(bytarr,bytarr.length);

     indxout.flush();

     indxout.close();

 }

 filereader.close();

现在我们有了所有所需的索引文件在RAM中的目录（或存储器），所以我们可以直接执行搜索索引文件。搜索代码将被用于搜索本地文件系统类似，唯一的变化是，现在将使用RAM的目录对象（RDIR），而不是使用本地文件系统目录路径创建的搜索对象。

 Searcher searcher = new IndexSearcher(rdir);

 Analyzer analyzer = new StandardAnalyzer();

 System.out.println("Total Documents = "+searcher.maxDoc()) ;

 QueryParser parser = new QueryParser("time", analyzer);

 Query query = parser.parse("02\\:24\\:04");

 Hits hits = searcher.search(query);

 System.out.println("Number of matching documents = "+ hits.length());

 for (int i = 0; i < hits.length(); i++)

 {

 Document doc = hits.doc(i);

 System.out.println(doc.get("date")+" "+ doc.get("time")+ " "+

 doc.get("cs-method")+ " "+ doc.get("cs-uri")+ " "+ doc.get("sc-status")+ " "+ doc.get("time-taken"));

以下输出，搜索是场上的“时间”和“时间”字段内的文本搜索“02 \ \ 24 \ \ 04。” 因此，运行代码时，所有的文件（或行）的“时间”字段中包含“02：\ \ 24 \ \ 04”，在输出中显示：

 Total Documents = 11

 Number of matching documents = 4

 2010-04-21 02:24:04 GET /blank 304 233

 2010-04-21 02:24:04 GET /blank 500 567

 2010-04-21 02:24:04 GET /blank 200 897

 2010-04-21 02:24:04 POST /blank 200 567

结论

像HDFS分布式文件系统是一个强大的工具，用于存储和访问大量的数据提供给我们的今天。随着内存的索引和搜索，访问数据，你真的想找到你不关心数据的群山之中得到稍微容易一些。

在Hadoop分布式文件系统的索引和搜索的更多相关文章

Hadoop分布式文件系统HDFS详解
Hadoop分布式文件系统即Hadoop Distributed FileSystem. 当数据集的大小超过一*立的物理计算机的存储能力时,就有必要对它进行分区(Partition)并 ...
Hive数据导入——数据存储在Hadoop分布式文件系统中，往Hive表里面导入数据只是简单的将数据移动到表所在的目录中！
转自:http://blog.csdn.net/lifuxiangcaohui/article/details/40588929 Hive是基于Hadoop分布式文件系统的,它的数据存储在Hadoop ...
初识hadoop --- (分布式文件系统 + 分块计算)
[转载] + 整理 2016-11-18 使用范围: Hadoop典型应用有:搜索.日志处理.推荐系统.数据分析.视频图像分析.数据保存等. Hadoop历史雏形开始于2002年的Apache的Nu ...
Hadoop 分布式文件系统：架构和设计
引言 Hadoop分布式文件系统(HDFS)被设计成适合运行在通用硬件(commodity hardware)上的分布式文件系统.它和现有的分布式文件系统有很多共同点.但同时,它和其他的分布式文件系统 ...
【官方文档】Hadoop分布式文件系统：架构和设计
http://hadoop.apache.org/docs/r1.0.4/cn/hdfs_design.html 引言前提和设计目标硬件错误流式数据访问大规模数据集简单的一致性模型 “移动计 ...
图解向hadoop分布式文件系统写文件的工作流程
网上看到一张关于hadoop分布式文件系统(hdfs)的工作原理的图片,其实主要是介绍了向hdfs写一个文件的流程.图中的流程已经非常清晰,直接上图好吧,博客园告诉我少于200字的文章不允许发布到网 ...
Hadoop分布式文件系统HDFS的工作原理
Hadoop分布式文件系统(HDFS)是一种被设计成适合运行在通用硬件上的分布式文件系统.HDFS是一个高度容错性的系统,适合部署在廉价的机器上.它能提供高吞吐量的数据访问,非常适合大规模数据集上的应 ...
第3章&colon;Hadoop分布式文件系统(1)
当数据量增大到超出了单个物理计算机存储容量时,有必要把它分开存储在多个不同的计算机中.那些管理存储在多个网络互连的计算机中的文件系统被称为"分布式文件系统".由于这些计算机是基于网 ...
Hadoop分布式文件系统使用指南
原文地址:http://hadoop.apache.org/docs/r1.0.4/cn/hdfs_user_guide.html 目的概述先决条件 Web接口 Shell命令 DFSAdmin命 ...

随机推荐

Neural Pathways of Interaction Mediating the Central Control of Autonomic Bodily State 自主神经系统-大脑调节神经通路
Figure above: Critchley H D, Harrison N A. Visceral influences on brain and behavior[J]. Neuron, 201 ...
apache2&period;4 windows764 python cgi
修改conf下的httpd.conf; 1:Listen和ServerName修改为相同的端口号,如8066 2:ScriptAlias就让他留在原位置,"${SRVROOT}/cgi-bi ...
PHP try catch
本文转载自百度知道 http://zhidao.baidu.com/link?url=Wi5EOXIf12yBp9d_4VoFHCUFtlTPcZJ0sxidLspV6P7qAqYMap3IC6dXE ...
oracle 变量声明初始化赋值
DECLARE sname VARCHAR2(20); BEGIN sname:='xxx'; sname:=sname||' and tom'; dbms_output.put_line(sname ...
Codeforces Round &num;260 (Div&period; 2) B
Description Fedya studies in a gymnasium. Fedya's maths hometask is to calculate the following expre ...
共享内存shared pool （5）：详解一条SQL在library cache中解析
前面介绍的 shared pool,library cache结构,都是为了说明一条SQL是如何被解析的.先看下面的图: 图中涉及的各结构简单介绍父HANDLE,里面有父游标堆0的地址.. 父游标堆 ...
IO库 8&period;3
题目:什么情况下,下面的while循环会终止? while(cin >> i) /* ... */ 解答:当读取发生错误时上述while循环会终止.比如i是整形,却输入非整形的数:输入文件 ...
Page visibility 页面可见性
一直以来,判断页面是不是当前可见标签,浏览器有没有缩小都是比较麻烦的. 通过页面可见性API可以获得相关信息document.hidden 判断页面当前是不是可见的document.visibi ...
【由浅至深】redis 实现发布订阅的几种方式
非常感谢依乐祝发表文章<.NET Core开发者的福音之玩转Redis的又一傻瓜式神器推荐>,对csredis作了一次完整的诠释. 前言提到消息队列,最熟悉无疑是 rabbitmq,它基 ...
Python初始环境搭建和Pycharm的安装
首先我们来安装python 1.首先进入网站下载:点击打开链接(或自己输入网址https://www.python.org/downloads/),进入之后如下图,选择图中红色圈中区域进行下载.