使用Hadoop查找包含特定字符串的文件

I have around 1000 files and each file is of the size of 1GB. And I need to find a String in all these 1000 files and also which files contains that particular String. I am working with Hadoop File System and all those 1000 files are in Hadoop File System.

我有大约1000个文件，每个文件的大小为1GB。我需要在所有这1000个文件中找到一个字符串，以及哪些文件包含该特定的字符串。我正在使用Hadoop文件系统，所有这1000个文件都在Hadoop文件系统中。

All the 1000 files are under real folder, so If I do like this below, I will be getting all the 1000 files. And I need to find which files contains a particular String hello under real folder.

所有1000个文件都在真实文件夹下，所以如果我在下面这样做，我将获得所有1000个文件。我需要在真实文件夹下找到哪些文件包含特定的字符串hello。

bash-3.00$ hadoop fs -ls /technology/dps/real

And this is my data structure in hdfs-

这是我在hdfs中的数据结构 -

row format delimited 
fields terminated by '\29'
collection items terminated by ','
map keys terminated by ':'
stored as textfile

How I can write MapReduce jobs to do this particular problem so that I can find which files contains a particular string? Any simple example will be of great help to me.

我如何编写MapReduce作业来完成这个特定的问题，这样我才能找到哪些文件包含特定的字符串？任何简单的例子对我都有很大的帮助。

Update:-

更新： -

With the use of grep in Unix I can solve the above problem scenario, but it is very very slow and it takes lot of time to get the actual output-

在Unix中使用grep我可以解决上面的问题场景，但它非常慢，需要很多时间才能获得实际的输出 -

hadoop fs -ls /technology/dps/real | awk '{print $8}' | while read f; do hadoop fs -cat $f | grep cec7051a1380a47a4497a107fecb84c1 >/dev/null && echo $f; done

So that is the reason I was looking for some MapReduce jobs to do this kind of problem...

所以这就是我寻找一些MapReduce工作来解决这类问题的原因......

3 个解决方案

#1

It sounds like you're looking for a grep-like program, which is easy to implement using Hadoop Streaming (the Hadoop Java API would work too):

听起来你正在寻找类似grep的程序，使用Hadoop Streaming很容易实现（Hadoop Java API也可以）：

First, write a mapper that outputs the name of the file being processed if the line being processed contains your search string. I used Python, but any language would work:

首先，编写一个映射器，如果正在处理的行包含您的搜索字符串，则输出正在处理的文件的名称。我使用过Python，但任何语言都可以使用：

#!/usr/bin/env python
import os
import sys

SEARCH_STRING = os.environ["SEARCH_STRING"]

for line in sys.stdin:
    if SEARCH_STRING in line.split():
        print os.environ["map_input_file"]

This code reads the search string from the SEARCH_STRING environmental variable. Here, I split the input line and check whether the search string matches any of the splits; you could change this to perform a substring search or use regular expressions to check for matches.

此代码从SEARCH_STRING环境变量中读取搜索字符串。在这里，我分割输入行并检查搜索字符串是否与任何分割匹配;您可以更改此项以执行子字符串搜索或使用正则表达式来检查匹配项。

Next, run a Hadoop streaming job using this mapper and no reducers:

接下来，使用此映射器运行Hadoop流式传输作业，而不使用reducer：

$ bin/hadoop jar contrib/streaming/hadoop-streaming-*.jar \
    -D mapred.reduce.tasks=0
    -input hdfs:///data \
    -mapper search.py \
    -file search.py \
    -output /search_results \
    -cmdenv SEARCH_STRING="Apache"

The output will be written in several parts; to obtain a list of matches, you can simply cat the files (provided they aren't too big):

输出将分为几个部分;要获得匹配列表，您可以简单地捕获文件（前提是它们不是太大）：

$ bin/hadoop fs -cat /search_results/part-*
hdfs://localhost/data/CHANGES.txt
hdfs://localhost/data/CHANGES.txt
hdfs://localhost/data/ivy.xml   
hdfs://localhost/data/README.txt
...

#2

To get the filename you are currently processing, do:

要获取当前正在处理的文件名，请执行以下操作：

((FileSplit) context.getInputSplit()).getPath().getName()

When you are searching your file record by record, when you see hello, emit the above path (and maybe the line or anything else).

当您按记录搜索文件记录时，当您看到hello时，会发出上述路径（可能是行或其他任何内容）。

Set the number of reducers to 0, they aren't doing anything here.

将减速器的数量设置为0，它们在这里没有做任何事情。

Does 'row format delimited' mean that lines are delimited by a newline? in which case TextInputFormat and LineRecordReader work fine here.

“行格式分隔”是否意味着行由换行符分隔？在这种情况下，TextInputFormat和LineRecordReader在这里工作正常。

#3

You can try something like this, though I'm not sure if it's an efficient way to go about it. Let me know if it works - I haven't tested it or anything.

你可以试试这样的东西，虽然我不确定它是否是一种有效的方法。让我知道它是否有效 - 我没有测试它或任何东西。

You can use it like this: java SearchFiles /technology/dps/real hello making sure you run it from the appropriate directory of course.

您可以像这样使用它：java SearchFiles / technology / dps / real hello确保您从相应的目录运行它。

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Scanner;

public class SearchFiles {

    public static void main(String[] args) throws IOException {
        if (args.length < 2) {
            System.err.println("Usage: [search-dir] [search-string]");
            return;
        }
        File searchDir = new File(args[0]);
        String searchString = args[1];
        ArrayList<File> matches = checkFiles(searchDir.listFiles(), searchString, new ArrayList<File>());
        System.out.println("These files contain '" + searchString + "':");
        for (File file : matches) {
            System.out.println(file.getPath());
        }
    }

    private static ArrayList<File> checkFiles(File[] files, String search, ArrayList<File> acc) throws IOException {
        for (File file : files) {
            if (file.isDirectory()) {
                checkFiles(file.listFiles(), search, acc);
            } else {
                if (fileContainsString(file, search)) {
                    acc.add(file);
                }
            }
        }
        return acc;
    }

    private static boolean fileContainsString(File file, String search) throws IOException {
        BufferedReader in = new BufferedReader(new FileReader(file));
        String line;
        while ((line = in.readLine()) != null) {
            if (line.contains(search)) {
                in.close();
                return true;
            }
        }
        in.close();
        return false;
    }
}

#1