My data contains text messages which look like the below. I want to extract the block age from them.
我的数据包含如下所示的文本消息。我想从他们身上提取出block age。
x:
my block is 8 years old and I am happy with it. I had been travelling since 2 years and that’s fun too…..
He invested in my 1 year block and is happy with the returns
He re-invested in my 1.5 year old block
i had come to U.K for 4 years and when I reach Germany my block will be of 5 years
I extracted the number followed by the word "year" or "years", But I realised I should be picking the number closer to the word "block".
我提取了数字后面的“年”或“年”,但我意识到我应该选择更接近“块”的数字。
library(stringr)
> str_extract_all(x, "[0-9.]{1,3}.year|[0-9.]{1,3}.years")
[[1]]
[1] "8 years" "2 years"
[[2]]
[1] "1 year"
[[3]]
[1] "1.5 year"
[[4]]
[1] "4 years" "5 years"
I want the output to be a list containing
我希望输出是一个包含的列表
8 years
1 year
1.5 year
5 years
I was thinking of extracting part of the sentence which contain the words "block", "old". But I am not quite clear on how to implement this. Any ideas or suggestions to better this process would be helpful.
我想提取句子中包含“block”、“old”的部分。但我不太清楚如何实现这一点。任何改进这一过程的想法或建议都是有用的。
THANKS
谢谢
3 个解决方案
#1
3
Here's a solution which keeps using stringr
:
这是一个一直使用stringr的解决方案:
library(stringr)
m1 <- str_match(x, "block.*?([0-9.]{1,3}.year[s]?)")
m2 <- str_match(x, "([0-9.]{1,3}.year[s]?).*?block")
sapply(seq_along(x), function(i) {
if (is.na(m1[i, 1])) m2[i, 2]
else if (is.na(m2[i, 1])) m1[i, 2]
else if (str_length(m1[i, 1]) < str_length(m2[i, 1])) m1[i, 2]
else m2[i, 2]
})
## [1] "8 years" "1 year" "1.5 year" "5 years"
Or equivalently:
或者说:
m1 <- str_match(x, "block.*?([0-9.]{1,3}.year[s]?)")
m2 <- str_match(x, "([0-9.]{1,3}.year[s]?).*?block")
cbind(m1[,2], m2[,2])[cbind(1:nrow(m12), apply(str_length(cbind(m1[,1], m2[,1])), 1, which.min))]
Both solutions assume that "block" appears in each string exactly once.
两种解决方案都假设每个字符串中只出现一次“block”。
#2
0
One idea is to get the position of "blocks" words and "ages". Then for each block compute the nearest age. I am using gregexpr
to compute get the position.
一个想法是得到“块”字和“年龄”的位置。然后对每个块计算最近的年龄。我使用了gregexpr来计算得到这个位置。
## position of blocks
d_block <- unlist(gregexpr('block',txt))
## position of ages
## Note here that I am using ? to simplify your regex
d_age <- unlist(gregexpr("[0-9.]{1,3}.years?",txt))
## for each block , get the nearest age position
nearest <- sapply(d_block,function(x)d_age[which.min(abs(x-d_age))])
## get ages values
all_ages <- unlist(regmatches(txt,gregexpr("[0-9.]{1,3}.years?",txt)))
## filter to keep only ages near to block
all_ages[d_age %in% nearest]
"8 years" "1 year" "1.5 year" "5 years"
#3
0
This approach gets the shortest distance "year" or "years" word from a "block", and then removes all the rest of the "year" or "years" in each message before performing your str_extract_all
line
这种方法从“块”中获取最短的“年”或“年”字,然后在执行str_extract_all行之前,在每个消息中删除所有其余的“年”或“年”
goodyear <- lapply(x, function(x) if(length(grep("year", unlist(strsplit(x, " ")))) > 1) grep("year", unlist(strsplit(x, " ")))[which.min(abs(grep("block", unlist(strsplit(x, " "))) - grep("year", unlist(strsplit(x, " ")))))])
for(i in seq_len(length(x))){
if(!is.null(goodyear[[i]])){
print(str_extract_all(paste(unlist(strsplit(x[[i]], " "))[-setdiff(grep("year", unlist(strsplit(x[[i]], " "))), goodyear[[i]])], collapse = " "), "[0-9.]{1,3}.year|[0-9.]{1,3}.years"))
} else print(str_extract_all(x[[i]], "[0-9.]{1,3}.year|[0-9.]{1,3}.years"))
}
## [[1]]
## [1] "8 years"
##
## [[1]]
## [1] "1 year"
##
## [[1]]
## [1] "1.5 year"
##
## [[1]]
## [1] "5 years"
#1
3
Here's a solution which keeps using stringr
:
这是一个一直使用stringr的解决方案:
library(stringr)
m1 <- str_match(x, "block.*?([0-9.]{1,3}.year[s]?)")
m2 <- str_match(x, "([0-9.]{1,3}.year[s]?).*?block")
sapply(seq_along(x), function(i) {
if (is.na(m1[i, 1])) m2[i, 2]
else if (is.na(m2[i, 1])) m1[i, 2]
else if (str_length(m1[i, 1]) < str_length(m2[i, 1])) m1[i, 2]
else m2[i, 2]
})
## [1] "8 years" "1 year" "1.5 year" "5 years"
Or equivalently:
或者说:
m1 <- str_match(x, "block.*?([0-9.]{1,3}.year[s]?)")
m2 <- str_match(x, "([0-9.]{1,3}.year[s]?).*?block")
cbind(m1[,2], m2[,2])[cbind(1:nrow(m12), apply(str_length(cbind(m1[,1], m2[,1])), 1, which.min))]
Both solutions assume that "block" appears in each string exactly once.
两种解决方案都假设每个字符串中只出现一次“block”。
#2
0
One idea is to get the position of "blocks" words and "ages". Then for each block compute the nearest age. I am using gregexpr
to compute get the position.
一个想法是得到“块”字和“年龄”的位置。然后对每个块计算最近的年龄。我使用了gregexpr来计算得到这个位置。
## position of blocks
d_block <- unlist(gregexpr('block',txt))
## position of ages
## Note here that I am using ? to simplify your regex
d_age <- unlist(gregexpr("[0-9.]{1,3}.years?",txt))
## for each block , get the nearest age position
nearest <- sapply(d_block,function(x)d_age[which.min(abs(x-d_age))])
## get ages values
all_ages <- unlist(regmatches(txt,gregexpr("[0-9.]{1,3}.years?",txt)))
## filter to keep only ages near to block
all_ages[d_age %in% nearest]
"8 years" "1 year" "1.5 year" "5 years"
#3
0
This approach gets the shortest distance "year" or "years" word from a "block", and then removes all the rest of the "year" or "years" in each message before performing your str_extract_all
line
这种方法从“块”中获取最短的“年”或“年”字,然后在执行str_extract_all行之前,在每个消息中删除所有其余的“年”或“年”
goodyear <- lapply(x, function(x) if(length(grep("year", unlist(strsplit(x, " ")))) > 1) grep("year", unlist(strsplit(x, " ")))[which.min(abs(grep("block", unlist(strsplit(x, " "))) - grep("year", unlist(strsplit(x, " ")))))])
for(i in seq_len(length(x))){
if(!is.null(goodyear[[i]])){
print(str_extract_all(paste(unlist(strsplit(x[[i]], " "))[-setdiff(grep("year", unlist(strsplit(x[[i]], " "))), goodyear[[i]])], collapse = " "), "[0-9.]{1,3}.year|[0-9.]{1,3}.years"))
} else print(str_extract_all(x[[i]], "[0-9.]{1,3}.year|[0-9.]{1,3}.years"))
}
## [[1]]
## [1] "8 years"
##
## [[1]]
## [1] "1 year"
##
## [[1]]
## [1] "1.5 year"
##
## [[1]]
## [1] "5 years"