有效地查找两个ksh或bash数组之间的共同元素

时间:2021-01-03 12:47:48

I am writing a Korn shell script. I have two arrays (say, arr1 and arr2), both containing strings, and I need to check which elements from arr1 are present (as whole strings or substrings) in arr2. The most intuitive solution is having nested for loops, and checking if each element from arr1 can be found in arr2 (through grep) like this:

我正在写一个Korn shell脚本。我有两个数组(比如arr1和arr2),两个都包含字符串,我需要检查arr2中哪些元素来自arr1(作为整个字符串或子字符串)。最直观的解决方案是嵌套for循环,并检查是否可以在arr2(通过grep)中找到arr1中的每个元素,如下所示:

for arr1Element in ${arr1[*]}; do
    for arr2Element in ${arr2[*]}; do
        # using grep to check if arr1Element is present in arr2Element
        echo $arr2Element | grep $arr1Element
    done
done

The issue is that arr2 has around 3000 elements, so running a nested loop takes a long time. I am wondering if there is a better way to do this in Bash.

问题是arr2有大约3000个元素,因此运行嵌套循环需要很长时间。我想知道在Bash中是否有更好的方法可以做到这一点。

If I were doing this in Java, I could have calculated hashes for elements in one of the arrays, and then looked for those hashes in the other array, but I don't think Bash has any functionality for doing something like this (unless I was willing to write a hash calculating function in Bash).

如果我在Java中执行此操作,我可以计算其中一个数组中的元素的哈希值,然后在另一个数组中查找这些哈希值,但我不认为Bash有任何功能可以执行此类操作(除非我愿意在Bash中编写哈希计算函数。

Any suggestions?

有什么建议么?

5 个解决方案

#1


3  

Since version 4.0 Bash has associative arrays:

从版本4.0开始,Bash具有关联数组:

$ declare -A elements
$ elements[hello]=world
$ echo ${elements[hello]}
world

You can use this in the same way you would a Java Map.

您可以像使用Java Map一样使用它。

declare -A map
for el in "${arr1[@]}"; do 
    map[$el]="x"
done

for el in "${arr2[@]}"; do 
    if [ -n "${map[$el]}" ] ; then 
       echo "${el}"
    fi
done

Dealing with substrings is an altogether more weighty problem, and would be a challenge in any language, short of the brute-force algorithm you're already using. You could build a binary-tree index of character sequences, but I wouldn't try that in Bash!

处理子串是一个更加重要的问题,并且在任何语言中都是一个挑战,缺少你已经使用的蛮力算法。您可以构建字符序列的二叉树索引,但我不会在Bash中尝试!

#2


2  

BashFAQ #36 describes doing set arithmetic (unions, disjoint sets, etc) in bash with comm.

BashFAQ#36描述了使用comm在bash中进行集算术(联合,不相交集等)。

Assuming your values can't contain literal newlines, the following will emit a line per item in both arr1 and arr2:

假设您的值不能包含文字换行符,则以下内容将在arr1和arr2中为每个项目发出一行:

comm -12 <(printf '%s\n' "${arr1[@]}" | sort -u) \
         <(printf '%s\n' "${arr2[@]}" | sort -u)

If your arrays are pre-sorted, you can remove the sorts (which will make this extremely memory- and time-efficient with large arrays, moreso than the grep-based approach).

如果您的数组已预先排序,您可以删除排序(这将使大型阵列的内存和时间效率极高,比基于grep的方法更多)。

#3


2  

Since you're OK with using grep, and since you want to match substrings as well as full strings, one approach is to write:

由于你可以使用grep,并且因为你想匹配子串和完整的字符串,一种方法是写:

printf '%s\n' "${arr2[@]}" \
  | grep -o -F "$(printf '%s\n' "${arr1[@]}")

and let grep optimize as it sees fit.

并让grep根据需要进行优化。

#4


0  

Here's a bash/awk idea:

这是一个bash / awk的想法:

# some sample arrays

$ arr1=( my first string "hello wolrd")
$ arr2=( my last  stringbean strings "well, hello world!)

# break array elements into separate lines

$ printf '%s\n' "${arr1[@]}"
my
first
string
hello world

$ printf '%s\n' "${arr2[@]}"
my
last
stringbean
strings
well, hello world!

# use the 'printf' command output as input to our awk command

$ awk '
NR==FNR { a[NR]=$0 ; next }
{ for (i in a)
      if ($0 ~ a[i]) print "array1 string {"a[i]"} is a substring of array2 string {"$0"}" }
' <( printf '%s\n' "${arr1[@]}" ) \
  <( printf '%s\n' "${arr2[@]}" )

array1 string {my} is a substring of array2 string {my}
array1 string {string} is a substring of array2 string {stringbean}
array1 string {string} is a substring of array2 string {strings}
array1 string {hello world} is a substring of array2 string {well, hello world!}
  • NR==FNR : for file #1 only: store elements into awk array named 'a'
  • NR == FNR:仅用于文件#1:将元素存储到名为'a'的awk数组中
  • next : process next line in file #1; at this point rest of awk script is ignored for file #1; the for each line in file #2 ...
  • next:处理文件#1中的下一行;此时,文件#1会忽略其余的awk脚本;对于文件#2中的每一行...
  • for (i in a) : for each index 'i' in array 'a' ...
  • for(i in a):对于数组'a'中的每个索引'i'...
  • if ($0 ~ a[i] ) : see if a[i] is a substring of the current line ($0) from file #2 and if so ...
  • if($ 0~a [i]):查看a [i]是否是文件#2中当前行($ 0)的子字符串,如果是的话......
  • print "array1... : output info about the match
  • print“array1 ...:输出有关匹配的信息

A test run using the following data:

使用以下数据进行测试运行:

arr1 == 3300 elements
arr2 ==  500 elements

When all arr2 elements have a substring/pattern match in arr1 (ie, 500 matches), total time to run is ~27 seconds ... so the repetitive looping through the array takes a toll.

当所有arr2元素在arr1中都有一个子串/模式匹配(即500个匹配)时,总运行时间约为27秒......因此重复循环遍历数组需要付费。

Obviously (?) need to reduce the volume of repetitive actions ...

显然(?)需要减少重复动作的量......

  • for an exact string match the comm solution by Charles Duffy makes sense (it runs against the same 3300/500 test set in about 0.5 seconds)
  • 对于精确的字符串匹配,Charles Duffy的comm解决方案是有意义的(它在大约0.5秒内针对相同的3300/500测试集运行)
  • for a substring/pattern match I was able to get a egrep solution to run in about 5 seconds (see my other answer/post)
  • 对于子串/模式匹配,我能够获得一个egrep解决方案在大约5秒钟内运行(参见我的其他答案/帖子)

#5


0  

An egrep solution for substring/pattern matching ...

用于子串/模式匹配的egrep解决方案......

egrep -f <(printf '.*%s.*\n' "${arr1[@]}") \
         <(printf '%s\n'     "${arr2[@]}")
  • egrep -f : take patterns to search from the file designated by the -f, which in this case is ...
  • egrep -f:从-f指定的文件中搜索模式,在这种情况下是...
  • <(printf '.*%s.*\n' "${arr1[@]}") : convert arr1 elements into 1 pattern per line, appending a regex wild card character (.*) for prefix and suffix
  • <(printf'。*%s。* \ n'“$ {arr1 [@]}”):将arr1元素转换为每行1个模式,为前缀和后缀附加正则表达式通配符(。*)
  • <(printf '%s\n' "${arr2[@]}") : convert arr2 elements into 1 string per line
  • <(printf'%s \ n'“$ {arr2 [@]}”):将arr2元素转换为每行1个字符串

When run against a sample data set like:

针对以下示例数据集运行时:

arr1 == 3300 elements
arr2 ==  500 elements

... with 500 matches, total run time is ~5 seconds; there's still a good bit of repetitive processing going on with egrep but not as bad as seen with my other answer (bash/awk) ... and of course not as fast the comm solution which eliminates the repetitive processing.

...有500场比赛,总运行时间约为5秒;还有一些重复处理正在进行egrep,但没有看到我的其他答案(bash / awk)那么糟糕...当然没有那么快的通信解决方案,消除了重复处理。

#1


3  

Since version 4.0 Bash has associative arrays:

从版本4.0开始,Bash具有关联数组:

$ declare -A elements
$ elements[hello]=world
$ echo ${elements[hello]}
world

You can use this in the same way you would a Java Map.

您可以像使用Java Map一样使用它。

declare -A map
for el in "${arr1[@]}"; do 
    map[$el]="x"
done

for el in "${arr2[@]}"; do 
    if [ -n "${map[$el]}" ] ; then 
       echo "${el}"
    fi
done

Dealing with substrings is an altogether more weighty problem, and would be a challenge in any language, short of the brute-force algorithm you're already using. You could build a binary-tree index of character sequences, but I wouldn't try that in Bash!

处理子串是一个更加重要的问题,并且在任何语言中都是一个挑战,缺少你已经使用的蛮力算法。您可以构建字符序列的二叉树索引,但我不会在Bash中尝试!

#2


2  

BashFAQ #36 describes doing set arithmetic (unions, disjoint sets, etc) in bash with comm.

BashFAQ#36描述了使用comm在bash中进行集算术(联合,不相交集等)。

Assuming your values can't contain literal newlines, the following will emit a line per item in both arr1 and arr2:

假设您的值不能包含文字换行符,则以下内容将在arr1和arr2中为每个项目发出一行:

comm -12 <(printf '%s\n' "${arr1[@]}" | sort -u) \
         <(printf '%s\n' "${arr2[@]}" | sort -u)

If your arrays are pre-sorted, you can remove the sorts (which will make this extremely memory- and time-efficient with large arrays, moreso than the grep-based approach).

如果您的数组已预先排序,您可以删除排序(这将使大型阵列的内存和时间效率极高,比基于grep的方法更多)。

#3


2  

Since you're OK with using grep, and since you want to match substrings as well as full strings, one approach is to write:

由于你可以使用grep,并且因为你想匹配子串和完整的字符串,一种方法是写:

printf '%s\n' "${arr2[@]}" \
  | grep -o -F "$(printf '%s\n' "${arr1[@]}")

and let grep optimize as it sees fit.

并让grep根据需要进行优化。

#4


0  

Here's a bash/awk idea:

这是一个bash / awk的想法:

# some sample arrays

$ arr1=( my first string "hello wolrd")
$ arr2=( my last  stringbean strings "well, hello world!)

# break array elements into separate lines

$ printf '%s\n' "${arr1[@]}"
my
first
string
hello world

$ printf '%s\n' "${arr2[@]}"
my
last
stringbean
strings
well, hello world!

# use the 'printf' command output as input to our awk command

$ awk '
NR==FNR { a[NR]=$0 ; next }
{ for (i in a)
      if ($0 ~ a[i]) print "array1 string {"a[i]"} is a substring of array2 string {"$0"}" }
' <( printf '%s\n' "${arr1[@]}" ) \
  <( printf '%s\n' "${arr2[@]}" )

array1 string {my} is a substring of array2 string {my}
array1 string {string} is a substring of array2 string {stringbean}
array1 string {string} is a substring of array2 string {strings}
array1 string {hello world} is a substring of array2 string {well, hello world!}
  • NR==FNR : for file #1 only: store elements into awk array named 'a'
  • NR == FNR:仅用于文件#1:将元素存储到名为'a'的awk数组中
  • next : process next line in file #1; at this point rest of awk script is ignored for file #1; the for each line in file #2 ...
  • next:处理文件#1中的下一行;此时,文件#1会忽略其余的awk脚本;对于文件#2中的每一行...
  • for (i in a) : for each index 'i' in array 'a' ...
  • for(i in a):对于数组'a'中的每个索引'i'...
  • if ($0 ~ a[i] ) : see if a[i] is a substring of the current line ($0) from file #2 and if so ...
  • if($ 0~a [i]):查看a [i]是否是文件#2中当前行($ 0)的子字符串,如果是的话......
  • print "array1... : output info about the match
  • print“array1 ...:输出有关匹配的信息

A test run using the following data:

使用以下数据进行测试运行:

arr1 == 3300 elements
arr2 ==  500 elements

When all arr2 elements have a substring/pattern match in arr1 (ie, 500 matches), total time to run is ~27 seconds ... so the repetitive looping through the array takes a toll.

当所有arr2元素在arr1中都有一个子串/模式匹配(即500个匹配)时,总运行时间约为27秒......因此重复循环遍历数组需要付费。

Obviously (?) need to reduce the volume of repetitive actions ...

显然(?)需要减少重复动作的量......

  • for an exact string match the comm solution by Charles Duffy makes sense (it runs against the same 3300/500 test set in about 0.5 seconds)
  • 对于精确的字符串匹配,Charles Duffy的comm解决方案是有意义的(它在大约0.5秒内针对相同的3300/500测试集运行)
  • for a substring/pattern match I was able to get a egrep solution to run in about 5 seconds (see my other answer/post)
  • 对于子串/模式匹配,我能够获得一个egrep解决方案在大约5秒钟内运行(参见我的其他答案/帖子)

#5


0  

An egrep solution for substring/pattern matching ...

用于子串/模式匹配的egrep解决方案......

egrep -f <(printf '.*%s.*\n' "${arr1[@]}") \
         <(printf '%s\n'     "${arr2[@]}")
  • egrep -f : take patterns to search from the file designated by the -f, which in this case is ...
  • egrep -f:从-f指定的文件中搜索模式,在这种情况下是...
  • <(printf '.*%s.*\n' "${arr1[@]}") : convert arr1 elements into 1 pattern per line, appending a regex wild card character (.*) for prefix and suffix
  • <(printf'。*%s。* \ n'“$ {arr1 [@]}”):将arr1元素转换为每行1个模式,为前缀和后缀附加正则表达式通配符(。*)
  • <(printf '%s\n' "${arr2[@]}") : convert arr2 elements into 1 string per line
  • <(printf'%s \ n'“$ {arr2 [@]}”):将arr2元素转换为每行1个字符串

When run against a sample data set like:

针对以下示例数据集运行时:

arr1 == 3300 elements
arr2 ==  500 elements

... with 500 matches, total run time is ~5 seconds; there's still a good bit of repetitive processing going on with egrep but not as bad as seen with my other answer (bash/awk) ... and of course not as fast the comm solution which eliminates the repetitive processing.

...有500场比赛,总运行时间约为5秒;还有一些重复处理正在进行egrep,但没有看到我的其他答案(bash / awk)那么糟糕...当然没有那么快的通信解决方案,消除了重复处理。