在今天的信息饱和的世界,地理分布的数据,需要一种系统的巨大增长,有利于快速检索有意义的结果的解析。分布式数据的可搜索的索引去加速的过程很长的路要走。在这篇文章中,我演示了如何使用Lucene和Java的基本数据索引和搜索,如何使用RAM目录索引和搜索,如何创建居住在HDF的数据索引,以及如何搜索这些索引。由开发环境,Eclipse的Java 1.6的Lucene的2.4.0,3.4.2,和Hadoop 0.19.1上运行微软Windows XP SP3。
为了解决这个任务,我把Hadoop的。Apache Hadoop项目的开发可靠,可扩展,分布式计算开源软件,Hadoop分布式文件系统(HDFS)是专为跨广域网的存储和共享文件。HDFS是建立在商品硬件上运行,并提供了容错,资源管理,以及最重要的是,应用程序数据访问的高吞吐量。
2010-04-21 02:24:01 GET /blank 200 120
- 2010-04-21 - 日期字段
- 2时24分01秒 - 时间字段
- GET - 法域(GET或POST) - 我们将记为“CS-方法”
- /空白 - 请求的URL字段 - 我们将表示为“CS-URI”
- 200 - 状态代码的请求 - 我们会记为“SC-状态”
- 120 - 时间采取现场(完成请求所需的时间)
目前在我们的样本文件的数据位于一个"E:\DataFile"名为“test.txt的”如下:2010-04-21 02:24:01 GET /blank 200 120
2010-04-21 02:24:01 GET /US/registrationFrame 200 605
2010-04-21 02:24:02 GET /US/kids/boys 200 785
2010-04-21 02:24:02 POST /blank 304 56
2010-04-21 02:24:04 GET /blank 304 233
2010-04-21 02:24:04 GET /blank 500 567
2010-04-21 02:24:04 GET /blank 200 897
2010-04-21 02:24:04 POST /blank 200 567
2010-04-21 02:24:05 GET /US/search 200 658
2010-04-21 02:24:05 POST /US/shop 200 768
2010-04-21 02:24:05 GET /blank 200 347
// Creating IndexWriter object and specifying the path where Indexed
//files are to be stored.
IndexWriter indexWriter = new IndexWriter("E://DataFile/IndexFiles", new StandardAnalyzer(), true); // Creating BufferReader object and specifying the path of the file
//whose data is required to be indexed.
BufferedReader reader= new BufferedReader(new FileReader("E://DataFile/Test.txt")); String row=null; // Reading each line present in the file.
while ((row=reader.readLine())!= null)
// Getting each field present in a row into an Array and file delimiter is "space separated"
String Arow[] = row.split(" "); // For each row, creating a document and adding data to the document with the associated fields.
org.apache.lucene.document.Document document = new org.apache.lucene.document.Document(); document.add(new Field("date",Arow[0],Field.Store.YES,Field.Index.ANALYZED));
document.add(new Field("time",Arow[1],Field.Store.YES,Field.Index.ANALYZED));
document.add(newField ("cs-method",Arow[2],Field.Store.YES,Field.Index.ANALYZED));
document.add(newField ("cs-uri",Arow[3],Field.Store.YES,Field.Index.ANALYZED));
document.add(newField ("sc-status",Arow[4],Field.Store.YES,Field.Index.ANALYZED));
document.add(newField ("time-taken",Arow[5],Field.Store.YES,Field.Index.ANALYZED)); // Adding document to the index file.
的Java代码一旦被执行,将创建和索引文件存放在“E :/ /DataFile/ IndexFiles的位置。”
// Creating Searcher object and specifying the path where Indexed files are stored.
Searcher searcher = new IndexSearcher("E://DataFile/IndexFiles");
Analyzer analyzer = new StandardAnalyzer(); // Printing the total number of documents or entries present in the index file.
System.out.println("Total Documents = "+searcher.maxDoc()) ; // Creating the QueryParser object and specifying the field name on
//which search has to be done.
QueryParser parser = new QueryParser("cs-uri", analyzer); // Creating the Query object and specifying the text for which search has to be done.
Query query = parser.parse("/blank"); // Below line performs the search on the index file and
Hits hits =; // Printing the number of documents or entries that match the search query.
System.out.println("Number of matching documents = "+ hits.length()); // Printing documents (or rows of file) that matched the search criteria.
for (int i = 0; i < hits.length(); i++)
Document doc = hits.doc(i);
System.out.println(doc.get("date")+" "+ doc.get("time")+ " "+
doc.get("cs-method")+ " "+ doc.get("cs-uri")+ " "+ doc.get("sc-status")+ " "+ doc.get("time-taken"));
Total Documents = 11
Number of matching documents = 7
2010-04-21 02:24:01 GET /blank 200 120
2010-04-21 02:24:02 POST /blank 304 56
2010-04-21 02:24:04 GET /blank 304 233
2010-04-21 02:24:04 GET /blank 500 567
2010-04-21 02:24:04 GET /blank 200 897
2010-04-21 02:24:04 POST /blank 200 567
2010-04-21 02:24:05 GET /blank 200 347
现在考虑数据的情况下,位于一个像Hadoop DFS分布式文件系统。上述代码将无法正常工作分布式数据上直接创建索引,所以我们不得不完成前几步的诉讼程序,如从HDFS数据复制到本地文件系统,创建索引的数据出现在本地文件系统,最后将索引文件存储到HDFS。同样的步骤将需要搜索。但这种方法耗时且最理想的,相反,让我们的索引和搜索我们的数据使用HDFS节点的内存中的数据是居住。
假设数据文件“Test.txt的”早期使用现在居住在HDFS上,里面一个工作目录文件夹,名为“/数据文件/ Test.txt的。” 创建另一个称为“/ IndexFiles”HDFS的工作目录里面的文件夹,我们生成的索引文件将被存储。下面的Java代码在内存中的文件存储在HDFS上创建索引文件:
// Path where the index files will be stored.
String Index_DIR="/IndexFiles/";
// Path where the data file is stored.
String File_DIR="/DataFile/test.txt";
// Creating FileSystem object, to be able to work with HDFS
Configuration config = new Configuration();
FileSystem dfs = FileSystem.get(config);
// Creating a RAMDirectory (memory) object, to be able to create index in memory.
RAMDirectory rdir = new RAMDirectory(); // Creating IndexWriter object for the Ram Directory
IndexWriter indexWriter = new IndexWriter (rdir, new StandardAnalyzer(), true); // Creating FSDataInputStream object, for reading the data from "Test.txt" file residing on HDFS.
FSDataInputStream filereader = Path(dfs.getWorkingDirectory()+ File_DIR));
String row=null; // Reading each line present in the file.
while ((row=reader.readLine())!=null)
{ // Getting each field present in a row into an Array and file //delimiter is "space separated".
String Arow[]=row.split(" "); // For each row, creating a document and adding data to the document
//with the associated fields.
org.apache.lucene.document.Document document = new org.apache.lucene.document.Document(); document.add(new Field("date",Arow[0],Field.Store.YES,Field.Index.ANALYZED));
document.add(new Field("time",Arow[1],Field.Store.YES,Field.Index.ANALYZED));
document.add(new Field ("cs-method",Arow[2],Field.Store.YES,Field.Index.ANALYZED));
document.add(new Field ("cs-uri",Arow[3],Field.Store.YES,Field.Index.ANALYZED));
document.add(new Field ("sc-status",Arow[4],Field.Store.YES,Field.Index.ANALYZED));
document.add(new Field ("time-taken",Arow[5],Field.Store.YES,Field.Index.ANALYZED)); // Adding document to the index file.
// Getting files present in memory into an array.
String fileList[]=rdir.list(); // Reading index files from memory and storing them to HDFS.
for (int i = 0; I < fileList.length; i++)
IndexInput indxfile = rdir.openInput(fileList[i].trim());
long len = indxfile.length();
int len1 = (int) len; // Reading data from file into a byte array.
byte[] bytarr = new byte[len1];
indxfile.readBytes(bytarr, 0, len1); // Creating file in HDFS directory with name same as that of
//index file
Path src = new Path(dfs.getWorkingDirectory()+Index_DIR+ fileList[i].trim());
dfs.createNewFile(src); // Writing data from byte array to the file in HDFS
FSDataOutputStream fs = dfs.create(new Path(dfs.getWorkingDirectory()+Index_DIR+fileList[i].trim()),true);
// Creating FileSystem object, to be able to work with HDFS
Configuration config = new Configuration();
FileSystem dfs = FileSystem.get(config); // Creating a RAMDirectory (memory) object, to be able to create index in memory.
RAMDirectory rdir = new RAMDirectory(); // Getting the list of index files present in the directory into an array.
Path pth = new Path(dfs.getWorkingDirectory()+Index_DIR);
FileSystemDirectory fsdir = new FileSystemDirectory(dfs,pth,false,config);
String filelst[] = fsdir.list();
FSDataInputStream filereader = null;
for (int i = 0; i<filelst.length; i++)
// Reading data from index files on HDFS directory into filereader object.
filereader = Path(dfs.getWorkingDirectory()+Index_DIR+filelst[i])); int size = filereader.available(); // Reading data from file into a byte array.
byte[] bytarr = new byte[size];, 0, size); // Creating file in RAM directory with names same as that of
//index files present in HDFS directory.
IndexOutput indxout = rdir.createOutput(filelst[i]); // Writing data from byte array to the file in RAM directory
Searcher searcher = new IndexSearcher(rdir);
Analyzer analyzer = new StandardAnalyzer(); System.out.println("Total Documents = "+searcher.maxDoc()) ; QueryParser parser = new QueryParser("time", analyzer); Query query = parser.parse("02\\:24\\:04"); Hits hits =; System.out.println("Number of matching documents = "+ hits.length()); for (int i = 0; i < hits.length(); i++)
Document doc = hits.doc(i);
System.out.println(doc.get("date")+" "+ doc.get("time")+ " "+
doc.get("cs-method")+ " "+ doc.get("cs-uri")+ " "+ doc.get("sc-status")+ " "+ doc.get("time-taken"));
以下输出,搜索是场上的“时间”和“时间”字段内的文本搜索“02 \ \ 24 \ \ 04。” 因此,运行代码时,所有的文件(或行)的“时间”字段中包含“02:\ \ 24 \ \ 04”,在输出中显示:
Total Documents = 11
Number of matching documents = 4
2010-04-21 02:24:04 GET /blank 304 233
2010-04-21 02:24:04 GET /blank 500 567
2010-04-21 02:24:04 GET /blank 200 897
2010-04-21 02:24:04 POST /blank 200 567
