I have been polishing up my grep skills with a particular problem I have found. Basically it goes like this. I have a local file with words from a dictionary. The user will pass in a word and the script will find all words that can be made with letters from that word. The catch is, the words must be at least 4 characters long and you can only use as many letters as the user passes in. So if I passed in a word like "College" clee and cell would be acceptable words but not words like cocco because yes it contains letters from the word but college only has 1 o and 1 c. Here is my regular expression thus far.
我一直在用我发现的一个特殊问题来提高我的grep技能。基本上就是这样的。我有一个本地文件,其中包含字典中的单词。用户将传入一个单词,脚本将找到可以使用该单词中的字母创建的所有单词。问题是,单词必须至少4个字符长,你只能使用与用户传入的字母一样多的字母。所以,如果我传入像“大学”这样的单词,那么单元格可以接受单词而不是像cocco这样的单词因为它包含来自该单词的字母,但大学只有1 o和1 c。到目前为止,这是我的正则表达式。
egrep -i "^[("$text")]{4,}$" /usr/dict/words
This will find strings that contain these letters that are at least four characters long however grep is being greedy and grabbing more characters than those in the variable. How would I specify to only use the amount of characters in the variable? I've been stuck on this for a few days now to no avail. Thank you for your help and time community!
这将找到包含这些字母的字符串,这些字母至少有四个字符,但grep是贪婪的并且比变量中的字符抓取更多的字符。我如何指定仅使用变量中的字符数量?我已经坚持了几天,现在无济于事。感谢您的帮助和时间社区!
1 个解决方案
#1
To expand on what @chepner said in the comments, regular expressions won't test for the exact number of characters that is in a range. In other words, [ee]
will not match 2 e
's it will only match if there is an e at all, so [ee]
is a redundant of [e]
. Regular expressions usually match 1 or more of a match expression [e]+
would match at least 1 e
up to the buffer size of the string. To match a specific number of e
's you'd have to know that before hand to do something like [e]{2,5}
which would match at least 2 but no more than 5 e
's.
为了扩展@chepner在评论中所说的内容,正则表达式不会测试范围内确切的字符数。换句话说,[ee]将不匹配2 e它只会匹配e,所以[ee]是[e]的冗余。正则表达式通常匹配一个或多个匹配表达式[e] +将匹配至少1 e直到字符串的缓冲区大小。要匹配特定数量的e,你必须事先知道[e] {2,5}之类的东西,它至少匹配2个但不超过5个e。
Even if you set a pre-processor to calculate the number of letters that are repeated in the input, you'd have a hard time matching the regular expression how you think it matches. To go with your example of "college", preprocessed would look like c=1,o=1,l=2, e=2,g=1. If you were to put it in a regular expression like you had ^c?o?l{0,2}e{0,2}g?$` [note a "?" in this context is short hand for {0,1}] would not even match "college" as the match would be positional it would match "colleg", "colleeg", "colleg", etc.
即使您设置预处理器来计算输入中重复的字母数,您也很难将正则表达式与您认为匹配的方式相匹配。与你的“大学”的例子一起,预处理看起来像c = 1,o = 1,l = 2,e = 2,g = 1。如果你把它放在正则表达式中,就像你有^ c?o?l {0,2} e {0,2} g?$`[注意a“?”在这种情况下,{0,1}]的简写甚至不匹配“大学”,因为匹配将是匹配“colleg”,“colleeg”,“colleg”等的位置。
To verify the length of the string what you have only verifies that there are at least for letters in the range []
. You may want to change it to grep "^.{4,}$"
to check whether the entire length is at least 4 characters.
要验证字符串的长度,您只能验证至少有[]范围内的字母。您可能希望将其更改为grep“^。{4,} $”以检查整个长度是否至少为4个字符。
If you aren't limited to only using grep, but are limited to bash, you may be able to use the below script to solve you're problem:
如果您不仅限于使用grep,但仅限于bash,您可以使用以下脚本来解决您的问题:
read input
cat /usr/dictwords | while read line
do
if $(echo $line | grep "^.\{4,\}\$" >> /dev/null)
then
testVal=$line
for i in $(echo $input | sed -e 's/\(.\)/\1 /g')
testVal=$(echo "$testVal" | sed -e "s/$i/_/i")
done
fi
if $(echo $testVal | grep "^_\+$" >> /dev/null)
then
echo $line
fi
done
#1
To expand on what @chepner said in the comments, regular expressions won't test for the exact number of characters that is in a range. In other words, [ee]
will not match 2 e
's it will only match if there is an e at all, so [ee]
is a redundant of [e]
. Regular expressions usually match 1 or more of a match expression [e]+
would match at least 1 e
up to the buffer size of the string. To match a specific number of e
's you'd have to know that before hand to do something like [e]{2,5}
which would match at least 2 but no more than 5 e
's.
为了扩展@chepner在评论中所说的内容,正则表达式不会测试范围内确切的字符数。换句话说,[ee]将不匹配2 e它只会匹配e,所以[ee]是[e]的冗余。正则表达式通常匹配一个或多个匹配表达式[e] +将匹配至少1 e直到字符串的缓冲区大小。要匹配特定数量的e,你必须事先知道[e] {2,5}之类的东西,它至少匹配2个但不超过5个e。
Even if you set a pre-processor to calculate the number of letters that are repeated in the input, you'd have a hard time matching the regular expression how you think it matches. To go with your example of "college", preprocessed would look like c=1,o=1,l=2, e=2,g=1. If you were to put it in a regular expression like you had ^c?o?l{0,2}e{0,2}g?$` [note a "?" in this context is short hand for {0,1}] would not even match "college" as the match would be positional it would match "colleg", "colleeg", "colleg", etc.
即使您设置预处理器来计算输入中重复的字母数,您也很难将正则表达式与您认为匹配的方式相匹配。与你的“大学”的例子一起,预处理看起来像c = 1,o = 1,l = 2,e = 2,g = 1。如果你把它放在正则表达式中,就像你有^ c?o?l {0,2} e {0,2} g?$`[注意a“?”在这种情况下,{0,1}]的简写甚至不匹配“大学”,因为匹配将是匹配“colleg”,“colleeg”,“colleg”等的位置。
To verify the length of the string what you have only verifies that there are at least for letters in the range []
. You may want to change it to grep "^.{4,}$"
to check whether the entire length is at least 4 characters.
要验证字符串的长度,您只能验证至少有[]范围内的字母。您可能希望将其更改为grep“^。{4,} $”以检查整个长度是否至少为4个字符。
If you aren't limited to only using grep, but are limited to bash, you may be able to use the below script to solve you're problem:
如果您不仅限于使用grep,但仅限于bash,您可以使用以下脚本来解决您的问题:
read input
cat /usr/dictwords | while read line
do
if $(echo $line | grep "^.\{4,\}\$" >> /dev/null)
then
testVal=$line
for i in $(echo $input | sed -e 's/\(.\)/\1 /g')
testVal=$(echo "$testVal" | sed -e "s/$i/_/i")
done
fi
if $(echo $testVal | grep "^_\+$" >> /dev/null)
then
echo $line
fi
done