I am writing a Korn shell script. I have two arrays (say, arr1
and arr2
), both containing strings, and I need to check which elements from arr1
are present (as whole strings or substrings) in arr2
. The most intuitive solution is having nested for loops, and checking if each element from arr1
can be found in arr2
(through grep
) like this:
我正在写一个Korn shell脚本。我有两个数组(比如arr1和arr2),两个都包含字符串,我需要检查arr2中哪些元素来自arr1(作为整个字符串或子字符串)。最直观的解决方案是嵌套for循环,并检查是否可以在arr2(通过grep)中找到arr1中的每个元素,如下所示:
for arr1Element in ${arr1[*]}; do
for arr2Element in ${arr2[*]}; do
# using grep to check if arr1Element is present in arr2Element
echo $arr2Element | grep $arr1Element
done
done
The issue is that arr2
has around 3000 elements, so running a nested loop takes a long time. I am wondering if there is a better way to do this in Bash.
问题是arr2有大约3000个元素,因此运行嵌套循环需要很长时间。我想知道在Bash中是否有更好的方法可以做到这一点。
If I were doing this in Java, I could have calculated hashes for elements in one of the arrays, and then looked for those hashes in the other array, but I don't think Bash has any functionality for doing something like this (unless I was willing to write a hash calculating function in Bash).
如果我在Java中执行此操作,我可以计算其中一个数组中的元素的哈希值,然后在另一个数组中查找这些哈希值,但我不认为Bash有任何功能可以执行此类操作(除非我愿意在Bash中编写哈希计算函数。
Any suggestions?
有什么建议么?
5 个解决方案
#1
3
Since version 4.0 Bash has associative arrays:
从版本4.0开始,Bash具有关联数组:
$ declare -A elements
$ elements[hello]=world
$ echo ${elements[hello]}
world
You can use this in the same way you would a Java Map.
您可以像使用Java Map一样使用它。
declare -A map
for el in "${arr1[@]}"; do
map[$el]="x"
done
for el in "${arr2[@]}"; do
if [ -n "${map[$el]}" ] ; then
echo "${el}"
fi
done
Dealing with substrings is an altogether more weighty problem, and would be a challenge in any language, short of the brute-force algorithm you're already using. You could build a binary-tree index of character sequences, but I wouldn't try that in Bash!
处理子串是一个更加重要的问题,并且在任何语言中都是一个挑战,缺少你已经使用的蛮力算法。您可以构建字符序列的二叉树索引,但我不会在Bash中尝试!
#2
2
BashFAQ #36 describes doing set arithmetic (unions, disjoint sets, etc) in bash with comm
.
BashFAQ#36描述了使用comm在bash中进行集算术(联合,不相交集等)。
Assuming your values can't contain literal newlines, the following will emit a line per item in both arr1 and arr2:
假设您的值不能包含文字换行符,则以下内容将在arr1和arr2中为每个项目发出一行:
comm -12 <(printf '%s\n' "${arr1[@]}" | sort -u) \
<(printf '%s\n' "${arr2[@]}" | sort -u)
If your arrays are pre-sorted, you can remove the sort
s (which will make this extremely memory- and time-efficient with large arrays, moreso than the grep
-based approach).
如果您的数组已预先排序,您可以删除排序(这将使大型阵列的内存和时间效率极高,比基于grep的方法更多)。
#3
2
Since you're OK with using grep
, and since you want to match substrings as well as full strings, one approach is to write:
由于你可以使用grep,并且因为你想匹配子串和完整的字符串,一种方法是写:
printf '%s\n' "${arr2[@]}" \
| grep -o -F "$(printf '%s\n' "${arr1[@]}")
and let grep
optimize as it sees fit.
并让grep根据需要进行优化。
#4
0
Here's a bash/awk
idea:
这是一个bash / awk的想法:
# some sample arrays
$ arr1=( my first string "hello wolrd")
$ arr2=( my last stringbean strings "well, hello world!)
# break array elements into separate lines
$ printf '%s\n' "${arr1[@]}"
my
first
string
hello world
$ printf '%s\n' "${arr2[@]}"
my
last
stringbean
strings
well, hello world!
# use the 'printf' command output as input to our awk command
$ awk '
NR==FNR { a[NR]=$0 ; next }
{ for (i in a)
if ($0 ~ a[i]) print "array1 string {"a[i]"} is a substring of array2 string {"$0"}" }
' <( printf '%s\n' "${arr1[@]}" ) \
<( printf '%s\n' "${arr2[@]}" )
array1 string {my} is a substring of array2 string {my}
array1 string {string} is a substring of array2 string {stringbean}
array1 string {string} is a substring of array2 string {strings}
array1 string {hello world} is a substring of array2 string {well, hello world!}
-
NR==FNR
: for file #1 only: store elements into awk array named 'a' - NR == FNR:仅用于文件#1:将元素存储到名为'a'的awk数组中
-
next
: process next line in file #1; at this point rest of awk script is ignored for file #1; the for each line in file #2 ... - next:处理文件#1中的下一行;此时,文件#1会忽略其余的awk脚本;对于文件#2中的每一行...
-
for (i in a)
: for each index 'i' in array 'a' ... - for(i in a):对于数组'a'中的每个索引'i'...
-
if ($0 ~ a[i] )
: see if a[i] is a substring of the current line ($0) from file #2 and if so ... - if($ 0~a [i]):查看a [i]是否是文件#2中当前行($ 0)的子字符串,如果是的话......
-
print "array1...
: output info about the match - print“array1 ...:输出有关匹配的信息
A test run using the following data:
使用以下数据进行测试运行:
arr1 == 3300 elements
arr2 == 500 elements
When all arr2
elements have a substring/pattern match in arr1
(ie, 500 matches), total time to run is ~27 seconds ... so the repetitive looping through the array takes a toll.
当所有arr2元素在arr1中都有一个子串/模式匹配(即500个匹配)时,总运行时间约为27秒......因此重复循环遍历数组需要付费。
Obviously (?) need to reduce the volume of repetitive actions ...
显然(?)需要减少重复动作的量......
- for an exact string match the
comm
solution by Charles Duffy makes sense (it runs against the same 3300/500 test set in about 0.5 seconds) - 对于精确的字符串匹配,Charles Duffy的comm解决方案是有意义的(它在大约0.5秒内针对相同的3300/500测试集运行)
- for a substring/pattern match I was able to get a
egrep
solution to run in about 5 seconds (see my other answer/post) - 对于子串/模式匹配,我能够获得一个egrep解决方案在大约5秒钟内运行(参见我的其他答案/帖子)
#5
0
An egrep
solution for substring/pattern matching ...
用于子串/模式匹配的egrep解决方案......
egrep -f <(printf '.*%s.*\n' "${arr1[@]}") \
<(printf '%s\n' "${arr2[@]}")
-
egrep -f
: take patterns to search from the file designated by the-f
, which in this case is ... - egrep -f:从-f指定的文件中搜索模式,在这种情况下是...
-
<(printf '.*%s.*\n' "${arr1[@]}")
: convertarr1
elements into 1 pattern per line, appending a regex wild card character (.*) for prefix and suffix - <(printf'。*%s。* \ n'“$ {arr1 [@]}”):将arr1元素转换为每行1个模式,为前缀和后缀附加正则表达式通配符(。*)
-
<(printf '%s\n' "${arr2[@]}")
: convertarr2
elements into 1 string per line - <(printf'%s \ n'“$ {arr2 [@]}”):将arr2元素转换为每行1个字符串
When run against a sample data set like:
针对以下示例数据集运行时:
arr1 == 3300 elements
arr2 == 500 elements
... with 500 matches, total run time is ~5 seconds; there's still a good bit of repetitive processing going on with egrep
but not as bad as seen with my other answer (bash/awk
) ... and of course not as fast the comm
solution which eliminates the repetitive processing.
...有500场比赛,总运行时间约为5秒;还有一些重复处理正在进行egrep,但没有看到我的其他答案(bash / awk)那么糟糕...当然没有那么快的通信解决方案,消除了重复处理。
#1
3
Since version 4.0 Bash has associative arrays:
从版本4.0开始,Bash具有关联数组:
$ declare -A elements
$ elements[hello]=world
$ echo ${elements[hello]}
world
You can use this in the same way you would a Java Map.
您可以像使用Java Map一样使用它。
declare -A map
for el in "${arr1[@]}"; do
map[$el]="x"
done
for el in "${arr2[@]}"; do
if [ -n "${map[$el]}" ] ; then
echo "${el}"
fi
done
Dealing with substrings is an altogether more weighty problem, and would be a challenge in any language, short of the brute-force algorithm you're already using. You could build a binary-tree index of character sequences, but I wouldn't try that in Bash!
处理子串是一个更加重要的问题,并且在任何语言中都是一个挑战,缺少你已经使用的蛮力算法。您可以构建字符序列的二叉树索引,但我不会在Bash中尝试!
#2
2
BashFAQ #36 describes doing set arithmetic (unions, disjoint sets, etc) in bash with comm
.
BashFAQ#36描述了使用comm在bash中进行集算术(联合,不相交集等)。
Assuming your values can't contain literal newlines, the following will emit a line per item in both arr1 and arr2:
假设您的值不能包含文字换行符,则以下内容将在arr1和arr2中为每个项目发出一行:
comm -12 <(printf '%s\n' "${arr1[@]}" | sort -u) \
<(printf '%s\n' "${arr2[@]}" | sort -u)
If your arrays are pre-sorted, you can remove the sort
s (which will make this extremely memory- and time-efficient with large arrays, moreso than the grep
-based approach).
如果您的数组已预先排序,您可以删除排序(这将使大型阵列的内存和时间效率极高,比基于grep的方法更多)。
#3
2
Since you're OK with using grep
, and since you want to match substrings as well as full strings, one approach is to write:
由于你可以使用grep,并且因为你想匹配子串和完整的字符串,一种方法是写:
printf '%s\n' "${arr2[@]}" \
| grep -o -F "$(printf '%s\n' "${arr1[@]}")
and let grep
optimize as it sees fit.
并让grep根据需要进行优化。
#4
0
Here's a bash/awk
idea:
这是一个bash / awk的想法:
# some sample arrays
$ arr1=( my first string "hello wolrd")
$ arr2=( my last stringbean strings "well, hello world!)
# break array elements into separate lines
$ printf '%s\n' "${arr1[@]}"
my
first
string
hello world
$ printf '%s\n' "${arr2[@]}"
my
last
stringbean
strings
well, hello world!
# use the 'printf' command output as input to our awk command
$ awk '
NR==FNR { a[NR]=$0 ; next }
{ for (i in a)
if ($0 ~ a[i]) print "array1 string {"a[i]"} is a substring of array2 string {"$0"}" }
' <( printf '%s\n' "${arr1[@]}" ) \
<( printf '%s\n' "${arr2[@]}" )
array1 string {my} is a substring of array2 string {my}
array1 string {string} is a substring of array2 string {stringbean}
array1 string {string} is a substring of array2 string {strings}
array1 string {hello world} is a substring of array2 string {well, hello world!}
-
NR==FNR
: for file #1 only: store elements into awk array named 'a' - NR == FNR:仅用于文件#1:将元素存储到名为'a'的awk数组中
-
next
: process next line in file #1; at this point rest of awk script is ignored for file #1; the for each line in file #2 ... - next:处理文件#1中的下一行;此时,文件#1会忽略其余的awk脚本;对于文件#2中的每一行...
-
for (i in a)
: for each index 'i' in array 'a' ... - for(i in a):对于数组'a'中的每个索引'i'...
-
if ($0 ~ a[i] )
: see if a[i] is a substring of the current line ($0) from file #2 and if so ... - if($ 0~a [i]):查看a [i]是否是文件#2中当前行($ 0)的子字符串,如果是的话......
-
print "array1...
: output info about the match - print“array1 ...:输出有关匹配的信息
A test run using the following data:
使用以下数据进行测试运行:
arr1 == 3300 elements
arr2 == 500 elements
When all arr2
elements have a substring/pattern match in arr1
(ie, 500 matches), total time to run is ~27 seconds ... so the repetitive looping through the array takes a toll.
当所有arr2元素在arr1中都有一个子串/模式匹配(即500个匹配)时,总运行时间约为27秒......因此重复循环遍历数组需要付费。
Obviously (?) need to reduce the volume of repetitive actions ...
显然(?)需要减少重复动作的量......
- for an exact string match the
comm
solution by Charles Duffy makes sense (it runs against the same 3300/500 test set in about 0.5 seconds) - 对于精确的字符串匹配,Charles Duffy的comm解决方案是有意义的(它在大约0.5秒内针对相同的3300/500测试集运行)
- for a substring/pattern match I was able to get a
egrep
solution to run in about 5 seconds (see my other answer/post) - 对于子串/模式匹配,我能够获得一个egrep解决方案在大约5秒钟内运行(参见我的其他答案/帖子)
#5
0
An egrep
solution for substring/pattern matching ...
用于子串/模式匹配的egrep解决方案......
egrep -f <(printf '.*%s.*\n' "${arr1[@]}") \
<(printf '%s\n' "${arr2[@]}")
-
egrep -f
: take patterns to search from the file designated by the-f
, which in this case is ... - egrep -f:从-f指定的文件中搜索模式,在这种情况下是...
-
<(printf '.*%s.*\n' "${arr1[@]}")
: convertarr1
elements into 1 pattern per line, appending a regex wild card character (.*) for prefix and suffix - <(printf'。*%s。* \ n'“$ {arr1 [@]}”):将arr1元素转换为每行1个模式,为前缀和后缀附加正则表达式通配符(。*)
-
<(printf '%s\n' "${arr2[@]}")
: convertarr2
elements into 1 string per line - <(printf'%s \ n'“$ {arr2 [@]}”):将arr2元素转换为每行1个字符串
When run against a sample data set like:
针对以下示例数据集运行时:
arr1 == 3300 elements
arr2 == 500 elements
... with 500 matches, total run time is ~5 seconds; there's still a good bit of repetitive processing going on with egrep
but not as bad as seen with my other answer (bash/awk
) ... and of course not as fast the comm
solution which eliminates the repetitive processing.
...有500场比赛,总运行时间约为5秒;还有一些重复处理正在进行egrep,但没有看到我的其他答案(bash / awk)那么糟糕...当然没有那么快的通信解决方案,消除了重复处理。