I have a data frame called "stemmoutput" (see below) :
我有一个名为“stemmoutput”的数据框架(见下文):
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 tanaman cabai
2 banget hama sakit tanaman
3 koramil nogosari melaks ecek hama tanaman padi ppl ds rambun
And I want to merge multiple columns values into one column like this :
我想将多个列值合并到一个列中,如下所示:
TEXT
1 tanaman cabai
2 banget hama sakit tanaman
3 koramil nogosari melaks ecek hama tanaman padi ppl ds rambun
I have tried this code, and it works
我试过这个代码,它是有效的
stemmoutput$TEXT <- with(stemmoutput, paste(X1,X2,X3,X4,X5,X6,X7,X8,X9,X10, sep=" "))
but is there any other way that is more efficient, without having to write down the name of the column one by one?
但是有没有其他更有效的方法,而不需要逐一写下列的名称呢?
I've also tried this code like below but that didn't work either.
我也试过下面这样的代码,但也没用。
for(i in names(stemmoutput)){
stemmoutput$TEXT <- with(stemmoutput, paste(i, sep=" "))}
2 个解决方案
#1
2
Try do.call
尝试do.call
library(stringr)
newdat <- data.frame(TEXT=str_trim(do.call(paste, stemmoutput)),
stringsAsFactors=FALSE)
newdat
# TEXT
#1 tanaman cabai
#2 banget hama sakit tanaman
#3 koramil nogosari melaks ecek hama tanaman padi ppl ds rambun
It may be better to use ,
as delimiter if there are multi-part words within a column
如果列中有多部分词,最好使用分隔符
TEXT <- gsub(', [^A-Za-z]+', '', do.call(paste, c(stemmoutput, sep=', ')))
newdat <- data.frame(TEXT, stringsAsFactors=FALSE)
newdat
# TEXT
#1 tanaman, cabai
#2 banget, hama, sakit, tanaman
#3 koramil, nogosari, melaks, ecek, hama, tanaman, padi, ppl, ds, rambun
#2
1
Here's another idea using tidyr
这是使用tidyr的另一个想法
If you want to unite
only columns from X1
to X10
you could do:
如果你只想将X1到X10的列合并,你可以这样做:
library(tidyr)
unite(stemmoutput, TEXT, num_range("X", 1:10), sep = " ")
If you want to unite all columns do:
如果你想把所有的栏目联合起来,请做:
unite(stemmoutput, TEXT, everything(), sep = " ")
Benchmarks
基准
I tried the two approaches on the benchmark because I suspected unite
would be much faster than do.call
, but they ended up being pretty equivalent:
我在基准测试中尝试了这两种方法,因为我怀疑unite的速度会比实际快得多。打电话,但结果是相当相似的:
df <- data.frame(replicate(10,sample(paste0(
sample(LETTERS[1:10]), collapse = ""), 10e5, replace = TRUE)))
mbm <- microbenchmark(
akrun = data.frame(TEXT=str_trim(do.call(paste, df)), stringsAsFactors=FALSE),
steven = unite(df, TEXT, everything(), sep = " "),
times = 50
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# akrun 1117.1350 1132.3861 1146.3943 1136.3094 1145.076 1232.5633 50 b
# steven 910.7432 924.0386 927.8614 927.7224 929.649 995.3584 50 a
#1
2
Try do.call
尝试do.call
library(stringr)
newdat <- data.frame(TEXT=str_trim(do.call(paste, stemmoutput)),
stringsAsFactors=FALSE)
newdat
# TEXT
#1 tanaman cabai
#2 banget hama sakit tanaman
#3 koramil nogosari melaks ecek hama tanaman padi ppl ds rambun
It may be better to use ,
as delimiter if there are multi-part words within a column
如果列中有多部分词,最好使用分隔符
TEXT <- gsub(', [^A-Za-z]+', '', do.call(paste, c(stemmoutput, sep=', ')))
newdat <- data.frame(TEXT, stringsAsFactors=FALSE)
newdat
# TEXT
#1 tanaman, cabai
#2 banget, hama, sakit, tanaman
#3 koramil, nogosari, melaks, ecek, hama, tanaman, padi, ppl, ds, rambun
#2
1
Here's another idea using tidyr
这是使用tidyr的另一个想法
If you want to unite
only columns from X1
to X10
you could do:
如果你只想将X1到X10的列合并,你可以这样做:
library(tidyr)
unite(stemmoutput, TEXT, num_range("X", 1:10), sep = " ")
If you want to unite all columns do:
如果你想把所有的栏目联合起来,请做:
unite(stemmoutput, TEXT, everything(), sep = " ")
Benchmarks
基准
I tried the two approaches on the benchmark because I suspected unite
would be much faster than do.call
, but they ended up being pretty equivalent:
我在基准测试中尝试了这两种方法,因为我怀疑unite的速度会比实际快得多。打电话,但结果是相当相似的:
df <- data.frame(replicate(10,sample(paste0(
sample(LETTERS[1:10]), collapse = ""), 10e5, replace = TRUE)))
mbm <- microbenchmark(
akrun = data.frame(TEXT=str_trim(do.call(paste, df)), stringsAsFactors=FALSE),
steven = unite(df, TEXT, everything(), sep = " "),
times = 50
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# akrun 1117.1350 1132.3861 1146.3943 1136.3094 1145.076 1232.5633 50 b
# steven 910.7432 924.0386 927.8614 927.7224 929.649 995.3584 50 a