I just begin learning regular expression and hadoop mapreduce. Now I am trying to run a hadoop mapreduce example application called "grep" and I would like to find a series of words like "and" "is" "are" "the" in a text input file. One of the input argument of the application "grep" is the regular expression which define the words you want to find. Let's say I want search for following words: "and" "is" "are" "the". Could any one give me an example for how to set up the regular expression as the input argument for the grep?
我刚开始学习正则表达式和hadoop mapreduce。现在我正在尝试运行名为“grep”的hadoop mapreduce示例应用程序,我想在文本输入文件中找到一系列单词,如“and”“is”“are”“the”。应用程序“grep”的输入参数之一是定义要查找的单词的正则表达式。假设我想搜索以下单词:“和”“是”“是”“是”。有没有人能举例说明如何将正则表达式设置为grep的输入参数?
Thanks.
2 个解决方案
#1
1
The use of grep
is recognized:
grep的使用被认可:
hadoop org.apache.examples.Grep <indir> <outdir> <regex>
So you could start off with something as simple as:
所以你可以从简单的事情开始:
hadoop org.apache.examples.Grep <indir> <outdir> '(and)|(is)|(are)|(the)'
#2
0
Your regular expression should be:
你的正则表达式应该是:
"\b(and|is|are|the)\b"
Put that as your regex argument.
把它作为你的正则表达式论证。
You can put more words-to-find between || which is an "or".
您可以在||之间添加更多单词这是一个“或”。
The "\b" means a word boundary, without \b, you could match a word inside another word for example: "scared" instead of "are" since "are" is contained in "scared".
“\ b”表示单词边界,没有\ b,您可以匹配另一个单词中的单词,例如:“scared”而不是“are”,因为“are”包含在“scared”中。
#1
1
The use of grep
is recognized:
grep的使用被认可:
hadoop org.apache.examples.Grep <indir> <outdir> <regex>
So you could start off with something as simple as:
所以你可以从简单的事情开始:
hadoop org.apache.examples.Grep <indir> <outdir> '(and)|(is)|(are)|(the)'
#2
0
Your regular expression should be:
你的正则表达式应该是:
"\b(and|is|are|the)\b"
Put that as your regex argument.
把它作为你的正则表达式论证。
You can put more words-to-find between || which is an "or".
您可以在||之间添加更多单词这是一个“或”。
The "\b" means a word boundary, without \b, you could match a word inside another word for example: "scared" instead of "are" since "are" is contained in "scared".
“\ b”表示单词边界,没有\ b,您可以匹配另一个单词中的单词,例如:“scared”而不是“are”,因为“are”包含在“scared”中。