I'm looking to use R to clean up some text strings from a database. The database stores the text complete with HTML tags. Unfortunately, due to database limitations, each string is broken into multiple fragments in the database. I think I could figure out how to remove the html tags with regular expressions and the help of other posts, but I don't expect those solutions will work unless I concatenate the fragments back together (opening/closing html tags can be spread across records in the dataframe). Here is some sample data:
我想用R来清理数据库中的一些文本字符串。数据库存储带有HTML标记的文本。不幸的是,由于数据库的限制,每个字符串都被分成数据库中的多个片段。我想我可以弄清楚如何使用正则表达式和其他帖子的帮助删除html标签,但我不希望这些解决方案能够工作,除非我将片段重新连接在一起(打开/关闭html标签可以分布在记录中在数据框中)。以下是一些示例数据:
Existing dataframe
现有数据帧
Record_nbr fragment Comments
1 1 "The quick brown"
1 2 "fox jumped over"
1 3 "the lazy dog."
2 1 "New Record."
Desired output dataframe
期望的输出数据帧
Record_nbr fragment Comments
1 3 "The quick brown fox jumped over the lazy dog."
2 2 "New Record."
Data:
数据:
dat <- read.table(text='Record_nbr fragment Comments
1 1 "The quick brown"
1 2 "fox jumped over"
1 3 "the lazy dog."
2 1 "New Record."', header=TRUE)
5 个解决方案
#1
0
It seems like the fragment
column becomes unusable after the split? Maybe
拆分后,片段列似乎变得无法使用?也许
> aggregate(dat[3], dat[1], paste)
# Record_nbr x
# 1 1 The quick brown fox jumped over the lazy dog.
# 2 2 New Record.
equivalent to
相当于
aggregate(Comments~Record_nbr, data = dat, paste)
#2
1
I am assuming that you didn't actually want to keep the fragment column. In this case you can use this quick one-liner:
我假设你实际上并不想保留片段列。在这种情况下,您可以使用这个快速单行:
aggregate(comment ~ Record_nbr, data=dat, function(x) paste(x, collapse=" "))
#3
0
Here's one of many approaches:
这是许多方法之一:
## ensure order
dat <- with(dat, dat[order(Record_nbr, fragment), ])
do.call(rbind, lapply(split(dat, dat$Record_nbr), function(x) {
data.frame(
x[1, 1, drop=FALSE],
fragment = max(x[, 2]),
Comments = paste(x$Comments, collapse=" ")
)
}))
## Record_nbr fragment Comments
## 1 1 3 The quick brown fox jumped over the lazy dog.
## 2 2 1 New Record.
#4
0
Using dplyr
:
使用dplyr:
library(dplyr)
dat %>%
group_by(Record_nbr) %>%
summarize(fragment= n(), Comments=paste(Comments, collapse= " "))
# Record_nbr fragment Comments
#1 1 3 The quick brown fox jumped over the lazy dog.
#2 2 1 New Record.
#5
0
Also consider using the quicker 'aggregate' function:
还要考虑使用更快的“聚合”功能:
aggregate(dat, by=list(dat$Record_nbr), paste, collapse=" ")
## Group.1 Record_nbr fragment Comments
## 1 1 1 1 1 1 2 3 The quick brown fox jumped over the lazy dog.
## 2 2 2 1 New Record.
Edit: You might have to play with the function inputs to get the exact outcome you want.
编辑:您可能必须使用功能输入来获得所需的确切结果。
#1
0
It seems like the fragment
column becomes unusable after the split? Maybe
拆分后,片段列似乎变得无法使用?也许
> aggregate(dat[3], dat[1], paste)
# Record_nbr x
# 1 1 The quick brown fox jumped over the lazy dog.
# 2 2 New Record.
equivalent to
相当于
aggregate(Comments~Record_nbr, data = dat, paste)
#2
1
I am assuming that you didn't actually want to keep the fragment column. In this case you can use this quick one-liner:
我假设你实际上并不想保留片段列。在这种情况下,您可以使用这个快速单行:
aggregate(comment ~ Record_nbr, data=dat, function(x) paste(x, collapse=" "))
#3
0
Here's one of many approaches:
这是许多方法之一:
## ensure order
dat <- with(dat, dat[order(Record_nbr, fragment), ])
do.call(rbind, lapply(split(dat, dat$Record_nbr), function(x) {
data.frame(
x[1, 1, drop=FALSE],
fragment = max(x[, 2]),
Comments = paste(x$Comments, collapse=" ")
)
}))
## Record_nbr fragment Comments
## 1 1 3 The quick brown fox jumped over the lazy dog.
## 2 2 1 New Record.
#4
0
Using dplyr
:
使用dplyr:
library(dplyr)
dat %>%
group_by(Record_nbr) %>%
summarize(fragment= n(), Comments=paste(Comments, collapse= " "))
# Record_nbr fragment Comments
#1 1 3 The quick brown fox jumped over the lazy dog.
#2 2 1 New Record.
#5
0
Also consider using the quicker 'aggregate' function:
还要考虑使用更快的“聚合”功能:
aggregate(dat, by=list(dat$Record_nbr), paste, collapse=" ")
## Group.1 Record_nbr fragment Comments
## 1 1 1 1 1 1 2 3 The quick brown fox jumped over the lazy dog.
## 2 2 2 1 New Record.
Edit: You might have to play with the function inputs to get the exact outcome you want.
编辑:您可能必须使用功能输入来获得所需的确切结果。