I want to extract specific string of favor
column in target
data which is matched by a dictionary
. Here is my data:
我想在目标数据中提取特定字符串的favyl列,该字符串与字典匹配。这是我的数据:
dictionary <- c("apple", "banana", "orange", "grape")
target <- data.frame("user" = c("A", "B", "C"),
"favor" = c("I like apple and banana", "grape and kiwi", "orange, banana and grape are the best"))
target
user favor
1 A I like apple and banana
2 B grape and kiwi
3 C orange, banana and grape are the best
And below is my expected outcome result
. I want to automatically create the column based on the most favors I matched in dictionary(in my case, 3), and extract the string I match in dictionary.
以下是我预期的结果结果。我想基于我在字典中匹配的最喜欢(在我的例子中,3)自动创建列,并提取我在字典中匹配的字符串。
result <- data.frame("user" = c("A", "B", "C"),
"favor_1" = c("apple", "grape", "orange"),
"favor_2" = c("banana", "", "banana"),
"favor_3" = c("", "", "grape"))
result
user favor_1 favor_2 favor_3
1 A apple banana
2 B grape
3 C orange banana grape
Any help will be thankful.
任何帮助都会感激不尽。
2 个解决方案
#1
1
Your best bet is probably to apply str_extract_all
to each row.
您最好的选择可能是将str_extract_all应用于每一行。
library(stringr)
result <- t(apply(target, 1,
function(x) str_extract_all(x[['favor']], dictionary, simplify = TRUE)))
[,1] [,2] [,3] [,4]
[1,] "apple" "banana" "" ""
[2,] "" "" "" "grape"
[3,] "" "banana" "orange" "grape"
#2
3
# Remove all words from `target$favor` that are not in the dictionary
result <- lapply(strsplit(target$favor, ',| '), function(x) { x[x %in% dictionary] })
result
# [[1]]
# [1] "apple" "banana"
#
# [[2]]
# [1] "grape"
#
# [[3]]
# [1] "orange" "banana" "grape"
# Fill in NAs when the rows have different numbers of items
result <- lapply(result, `length<-`, max(lengths(result)))
# Rebuild the data.frame using the list of words in each row
cbind(target[ , 'user', drop = F], do.call(rbind, result))
# user 1 2 3
# 1 A apple banana <NA>
# 2 B grape <NA> <NA>
# 3 C orange banana grape
Note that I read in target
with stringsAsFactors = FALSE
so that strsplit
can work.
请注意,我使用stringsAsFactors = FALSE读取目标,以便strsplit可以工作。
#1
1
Your best bet is probably to apply str_extract_all
to each row.
您最好的选择可能是将str_extract_all应用于每一行。
library(stringr)
result <- t(apply(target, 1,
function(x) str_extract_all(x[['favor']], dictionary, simplify = TRUE)))
[,1] [,2] [,3] [,4]
[1,] "apple" "banana" "" ""
[2,] "" "" "" "grape"
[3,] "" "banana" "orange" "grape"
#2
3
# Remove all words from `target$favor` that are not in the dictionary
result <- lapply(strsplit(target$favor, ',| '), function(x) { x[x %in% dictionary] })
result
# [[1]]
# [1] "apple" "banana"
#
# [[2]]
# [1] "grape"
#
# [[3]]
# [1] "orange" "banana" "grape"
# Fill in NAs when the rows have different numbers of items
result <- lapply(result, `length<-`, max(lengths(result)))
# Rebuild the data.frame using the list of words in each row
cbind(target[ , 'user', drop = F], do.call(rbind, result))
# user 1 2 3
# 1 A apple banana <NA>
# 2 B grape <NA> <NA>
# 3 C orange banana grape
Note that I read in target
with stringsAsFactors = FALSE
so that strsplit
can work.
请注意,我使用stringsAsFactors = FALSE读取目标,以便strsplit可以工作。