按值使用第二个数据帧值过滤数据帧

时间:2022-09-27 22:56:07

I always have had trouble with loops so I am asking here. 2 dataframes. 1 very large and 1 much smaller. Sample versions below.

我总是遇到循环问题所以我在这里问。 2个数据帧。 1个非常大,1个小得多。以下示例版本。

Dataframe 1

ID        Value
1         apples
1         apples
1         bananas
1         grapes
1         mangoes
1         oranges
1         grapes
1         apples
1         grapes
2         apples
2         apples
2         passionfruits
2         bananas
2         apples
2         apples
2         passionfruits
2         grapes
2         mangoes
2         apples
3         apples
3         bananas
3         oranges
3         apples
3         grapes
3         grapes
3         passionfruits
3         passionfruits
3         oranges
4         apples
4         oranges
4         mangoes
4         bananas
4         grapes
4         grapes
4         grapes
4         apples
4         oranges
4         grapes
4         mangoes
4         mangoes
4         apples
4         oranges
5         passionfruits
5         apples
5         oranges
5         oranges
5         mangoes
5         grapes
5         apples
5         bananas

Dataframe 2

Value
apples
apples
bananas
grapes
mangoes
mangoes
grapes
apples
apples

The different IDs in dataframe 1 are considered as sets. The dataframe 2 in its entirety will be an approximate or exact match to one of the sets. I know there is plenty of code to filter using the entire dataframe 2 to match with 1. But that is not what I require. I require it to filter sequentially value by value with conditions attached. The condition should be whether the previous value matches.

数据帧1中的不同ID被视为集合。数据帧2的整体将是与其中一个集合的近似或精确匹配。我知道有足够的代码可以使用整个数据帧2进行过滤以匹配1.但这不是我要求的。我要求它在附加条件的情况下依次按值过滤。条件应该是前一个值是否匹配。

So in this example with the first value nothing happens because all IDs have 'apples'. The second value = 'apples' given that previous value='apples' filters out ID = 4 because it doesnt contain 'apples' occurring twice in a row. Now in the filtered dataframe 1 we search for the third value and so on. It stops only when 1 ID set remains in Dataframe 1. So in this case after the 3rd iteration. Result should be

所以在这个例子中,第一个值没有任何反应,因为所有ID都有'apples'。第二个值='apples',前一个值='apples'过滤掉ID = 4,因为它不包含连续出现两次的'apples'。现在在过滤后的数据帧1中,我们搜索第三个值,依此类推。只有当1 ID设置保留在Dataframe 1中时才会停止。所以在这种情况下,在第3次迭代之后。结果应该是

Dataframe 1

ID        Value
1         apples
1         apples
1         bananas
1         grapes
1         mangoes
1         oranges
1         grapes
1         apples
1         grapes

3 个解决方案

#1


2  

A possible approach with data.table (an adaptation from my answer here):

data.table的一种可能的方法(从我的答案改编而来):

# load packages
library(data.table)

# create a function which calculates match-score with 'df2$Value'
maxscore <- function(x, y) {
  m <- mapply('==', shift(x, type = 'lead', n = 0:(length(y) - 1)), y)
  max(rowSums(m, na.rm = TRUE))
}

# calculate the match-score for each group
# and filter out the other groups
setDT(df1)[, score := maxscore(Value, df2$Value), by = ID
           ][score == max(score)][, score := NULL][]

which gives:

   ID   Value
1:  1  apples
2:  1  apples
3:  1 bananas
4:  1  grapes
5:  1 mangoes
6:  1 oranges
7:  1  grapes
8:  1  apples
9:  1  grapes

You can use that function in a dplyr-chain as well (but you will still need the data.table-package for the shift-function):

您也可以在dplyr链中使用该函数(但是仍然需要用于shift函数的data.table-package):

library(dplyr)
df1 %>% 
  group_by(ID) %>% 
  mutate(m = maxscore(Value, df2$Value)) %>% 
  ungroup() %>% 
  filter(m == max(m)) %>% 
  select(-m)

An alternative implementation of the maxscore-function (inspired by @doscendo's answer here):

maxscore-function的另一种实现方式(灵感来自@ doscendo的答案):

maxscore2 <- function(x, y) {
  w <- which(x == y[1])
  v <- sapply(w, function(i) sum(x[i:(i+(length(y)-1))] == y, na.rm = TRUE))
  max(v)
}

#2


0  

We can merge Value for each ID using a token separator (say #) and then write a custom function that compare how many sequential tokens were matched. Finally, select data for ID that has got the maximum match.

我们可以使用标记分隔符(比如#)合并每个ID的Value,然后编写一个自定义函数来比较匹配的顺序标记的数量。最后,选择具有最大匹配的ID的数据。

library(dplyr)

# This function matches and count tokens separated by `#`
# matched_count ("a#b#c","a#e#c#d")  will return 1
matched_count <- function(x, y){
  x_v <- strsplit(x, split = "#")[[1]]
  y_v <- strsplit(y, split = "#")[[1]]
  max_len <- max(length(x_v), length(y_v))
  length(x_v) <- max_len
  length(y_v) <- max_len
  sum(x_v==y_v,na.rm = TRUE)
}    


Dataframe1 %>% group_by(ID) %>%
  mutate(CompStr = paste0(Value, collapse="#")) %>% #collapse values for ID
  mutate(CompStrdf2 = paste0(Dataframe2$Value, collapse="#")) %>% 
  mutate(max_match = matched_count(CompStr, CompStrdf2)) %>%
  ungroup() %>%
  filter(max_match == max(max_match)) %>%
  select(ID, Value)

# ID Value  
# <int> <chr>  
# 1     1 apples 
# 2     1 apples 
# 3     1 bananas
# 4     1 grapes 
# 5     1 mangoes
# 6     1 oranges
# 7     1 grapes 
# 8     1 apples 
# 9     1 grapes 

#3


0  

I suggest turning the Values in each group into a string and comparing their string edit distance. adist - Compute the approximate string distance between character vectors. The distance is a generalized Levenshtein (edit) distance, giving the minimal possibly weighted number of insertions, deletions and substitutions needed to transform one string into another.

我建议将每个组中的值转换为字符串并比较它们的字符串编辑距离。 adist - 计算字符向量之间的近似字符串距离。该距离是一个广义的Levenshtein(编辑)距离,给出了将一个字符串转换为另一个字符串所需的最小可能加权数量的插入,删除和替换。

string_edit_dist <- function(vec1, vec2) {
    c(adist(paste0(vec1, collapse=""), paste0(vec2, collapse="")))
}    

ind <- which.min(sapply(seq_along(unique(df1$ID)), function(i) string_edit_dist(df1$Value[df1$ID==i], df2$Value)))
df1[df1$ID==ind, ]

  # ID   Value
# 1  1  apples
# 2  1  apples
# 3  1 bananas
# 4  1  grapes
# 5  1 mangoes
# 6  1 oranges
# 7  1  grapes
# 8  1  apples
# 9  1  grapes

Here is the string_edit_distance for each group

这是每个组的string_edit_distance

sapply(seq_along(unique(df1$ID)), function(i) string_edit_dist(df1$Value[df1$ID==i], df2$Value))
# 7 35 45 46 27

#1


2  

A possible approach with data.table (an adaptation from my answer here):

data.table的一种可能的方法(从我的答案改编而来):

# load packages
library(data.table)

# create a function which calculates match-score with 'df2$Value'
maxscore <- function(x, y) {
  m <- mapply('==', shift(x, type = 'lead', n = 0:(length(y) - 1)), y)
  max(rowSums(m, na.rm = TRUE))
}

# calculate the match-score for each group
# and filter out the other groups
setDT(df1)[, score := maxscore(Value, df2$Value), by = ID
           ][score == max(score)][, score := NULL][]

which gives:

   ID   Value
1:  1  apples
2:  1  apples
3:  1 bananas
4:  1  grapes
5:  1 mangoes
6:  1 oranges
7:  1  grapes
8:  1  apples
9:  1  grapes

You can use that function in a dplyr-chain as well (but you will still need the data.table-package for the shift-function):

您也可以在dplyr链中使用该函数(但是仍然需要用于shift函数的data.table-package):

library(dplyr)
df1 %>% 
  group_by(ID) %>% 
  mutate(m = maxscore(Value, df2$Value)) %>% 
  ungroup() %>% 
  filter(m == max(m)) %>% 
  select(-m)

An alternative implementation of the maxscore-function (inspired by @doscendo's answer here):

maxscore-function的另一种实现方式(灵感来自@ doscendo的答案):

maxscore2 <- function(x, y) {
  w <- which(x == y[1])
  v <- sapply(w, function(i) sum(x[i:(i+(length(y)-1))] == y, na.rm = TRUE))
  max(v)
}

#2


0  

We can merge Value for each ID using a token separator (say #) and then write a custom function that compare how many sequential tokens were matched. Finally, select data for ID that has got the maximum match.

我们可以使用标记分隔符(比如#)合并每个ID的Value,然后编写一个自定义函数来比较匹配的顺序标记的数量。最后,选择具有最大匹配的ID的数据。

library(dplyr)

# This function matches and count tokens separated by `#`
# matched_count ("a#b#c","a#e#c#d")  will return 1
matched_count <- function(x, y){
  x_v <- strsplit(x, split = "#")[[1]]
  y_v <- strsplit(y, split = "#")[[1]]
  max_len <- max(length(x_v), length(y_v))
  length(x_v) <- max_len
  length(y_v) <- max_len
  sum(x_v==y_v,na.rm = TRUE)
}    


Dataframe1 %>% group_by(ID) %>%
  mutate(CompStr = paste0(Value, collapse="#")) %>% #collapse values for ID
  mutate(CompStrdf2 = paste0(Dataframe2$Value, collapse="#")) %>% 
  mutate(max_match = matched_count(CompStr, CompStrdf2)) %>%
  ungroup() %>%
  filter(max_match == max(max_match)) %>%
  select(ID, Value)

# ID Value  
# <int> <chr>  
# 1     1 apples 
# 2     1 apples 
# 3     1 bananas
# 4     1 grapes 
# 5     1 mangoes
# 6     1 oranges
# 7     1 grapes 
# 8     1 apples 
# 9     1 grapes 

#3


0  

I suggest turning the Values in each group into a string and comparing their string edit distance. adist - Compute the approximate string distance between character vectors. The distance is a generalized Levenshtein (edit) distance, giving the minimal possibly weighted number of insertions, deletions and substitutions needed to transform one string into another.

我建议将每个组中的值转换为字符串并比较它们的字符串编辑距离。 adist - 计算字符向量之间的近似字符串距离。该距离是一个广义的Levenshtein(编辑)距离,给出了将一个字符串转换为另一个字符串所需的最小可能加权数量的插入,删除和替换。

string_edit_dist <- function(vec1, vec2) {
    c(adist(paste0(vec1, collapse=""), paste0(vec2, collapse="")))
}    

ind <- which.min(sapply(seq_along(unique(df1$ID)), function(i) string_edit_dist(df1$Value[df1$ID==i], df2$Value)))
df1[df1$ID==ind, ]

  # ID   Value
# 1  1  apples
# 2  1  apples
# 3  1 bananas
# 4  1  grapes
# 5  1 mangoes
# 6  1 oranges
# 7  1  grapes
# 8  1  apples
# 9  1  grapes

Here is the string_edit_distance for each group

这是每个组的string_edit_distance

sapply(seq_along(unique(df1$ID)), function(i) string_edit_dist(df1$Value[df1$ID==i], df2$Value))
# 7 35 45 46 27