R中两列的连接因子水平

时间:2021-12-24 07:36:45

I have 2 columns of data with the same type of data (Strings).

我有2列数据具有相同类型的数据(字符串)。

I want to join the levels of the columns. ie. we have:

我想加入列的级别。即。我们有:

col1   col2
Bob    John
Tom    Bob
Frank  Jane
Jim    Bob
Tom    Bob
...    ... (and so on)

now col1 has 4 levels (Bob, Tom Frank, Jim) and col2 has 3 levels (John, Jane, Bob)

现在col1有4级(Bob,Tom Frank,Jim),col2有3级(John,Jane,Bob)

But I want both columns to have all the factor levels (Bob, Tom, Frank, Jim, Jane, John), as to later replace each of the 'names' with a unique id, such that the final output would be:

但是我希望两个列都具有所有因子级别(Bob,Tom,Frank,Jim,Jane,John),以便稍后用唯一的id替换每个'names',这样最终的输出将是:

col1   col2
1      5
2      1
3      6
4      1
2      1

that is Bob -> 1, Tom -> 2, etc. in both columns.

这两个列中的Bob - > 1,Tom - > 2等。

Any ideas :) ?

有任何想法吗 :) ?

edit: Thanks all for the wonderful answers! You are all awesome as far as I know :)

编辑:谢谢大家的精彩答案!据我所知,你们都很棒:)

3 个解决方案

#1


6  

You want the factors to include all the unique names from both columns.

您希望因子包含两列中的所有唯一名称。

col1 <- factor(c("Bob", "Tom", "Frank", "Jim", "Tom"))
col2 <- factor(c("John", "Bob", "Jane", "Bob", "Bob"))
mynames <- unique(c(levels(col1), levels(col2)))
fcol1 <- factor(col1, levels = mynames)
fcol2 <- factor(col2, levels = mynames)

EDIT: a little nicer if you replace the third line with this:

编辑:如果你用这个替换第三行更好一点:

mynames <- union(levels(col1), levels(col2))

#2


11  

x <- structure(list(col1 = structure(c(1L, 4L, 2L, 3L, 4L), .Label = c("Bob", "Frank", "Jim", "Tom"), class = "factor"), col2 = structure(c(3L, 1L, 2L, 1L, 1L), .Label = c("Bob", "Jane", "John"), class = "factor")), .Names = c("col1", "col2"), class = "data.frame", row.names = c(NA, -5L))

Make a simple union of factor names:

建立因子名称的简单联合:

both <- union(levels(x$col1), levels(x$col2))

And relevel the two factors:

并重新考虑这两个因素:

x$col1 <- factor(x$col1, levels=both)
x$col2 <- factor(x$col2, levels=both)

After editing: added example to make numeric values from factors

编辑后:添加了从因子中生成数值的示例

You could simply transform the factor levels to numeric values, e.g.:

您可以简单地将因子级别转换为数值,例如:

as.numeric(x$col1)

Or a more simpler, nicer solution based on @Gavin Simpson's hint below in one step:

或者更简单,更好的解决方案基于@Gavin Simpson的一步提示:

data.matrix(x)

#3


2  

Could have sworn this didn't work when I was writing the abomination below, but it does now:

当我在下面写下令人憎恶的事情时,我可以发誓这不起作用,但它现在做了:

## self contained example:
txt <- "col1   col2
Bob    John
Tom    Bob
Frank  Jane
Jim    Bob
Tom    Bob"
dat <- read.table(textConnection(txt), header = TRUE)

Just compute unique set of levels and coerce each colX to a factor:

只计算一组唯一的级别并将每个colX强制转换为一个因子:

> dat3 <- dat
> lev <- as.character(unique(unlist(sapply(dat, levels))))
> dat3 <- within(dat3, col1 <- factor(col1, levels = lev))
> dat3 <- within(dat3, col2 <- factor(col2, levels = lev))
> str(dat3)
'data.frame':   5 obs. of  2 variables:
 $ col1: Factor w/ 6 levels "Bob","Tom","Frank",..: 1 2 3 4 2
 $ col2: Factor w/ 6 levels "Bob","Tom","Frank",..: 5 1 6 1 1
> data.matrix(dat3)
     col1 col2
[1,]    1    5
[2,]    2    1
[3,]    3    6
[4,]    4    1
[5,]    2    1

[Original: to show how stupidly complex and obfuscated one can write R code it one tries really hard!] Not sure this is particularly elegant (and it isn't), but...

[原文:为了表明一个人可以编写R代码是如此愚蠢复杂和混淆,人们会非常努力!]不确定这是否特别优雅(并且它不是),但......

We first unlist the data:

我们首先取消列出数据:

tmp <- unlist(dat)

then compute the unique levels

然后计算独特的水平

lev <- as.character(unique(tmp))

and then restructure tmp (from above) back into the same dimensions as the original data, convert to data.frame (preserving the strings), lapply over this data frame, creating a factor with levels lev computed above, and finally coerce to a data frame.

然后将tmp(从上面)重新调整回与原始数据相同的维度,转换为data.frame(保留字符串),在此数据帧上进行填充,创建具有上面计算的级别lev的因子,最后强制转换为数据帧。

dat2 <- data.frame(lapply(data.frame(matrix(tmp, ncol = ncol(dat)), 
                                     stringsAsFactors = FALSE), 
                          FUN = factor, levels = lev))

Which gives:

这使:

> dat2
     X1   X2
1   Bob John
2   Tom  Bob
3 Frank Jane
4   Jim  Bob
5   Tom  Bob
> sapply(dat2, levels)
     X1      X2     
[1,] "Bob"   "Bob"  
[2,] "Tom"   "Tom"  
[3,] "Frank" "Frank"
[4,] "Jim"   "Jim"  
[5,] "John"  "John" 
[6,] "Jane"  "Jane" 
> data.matrix(dat2)
     X1 X2
[1,]  1  5
[2,]  2  1
[3,]  3  6
[4,]  4  1
[5,]  2  1

#1


6  

You want the factors to include all the unique names from both columns.

您希望因子包含两列中的所有唯一名称。

col1 <- factor(c("Bob", "Tom", "Frank", "Jim", "Tom"))
col2 <- factor(c("John", "Bob", "Jane", "Bob", "Bob"))
mynames <- unique(c(levels(col1), levels(col2)))
fcol1 <- factor(col1, levels = mynames)
fcol2 <- factor(col2, levels = mynames)

EDIT: a little nicer if you replace the third line with this:

编辑:如果你用这个替换第三行更好一点:

mynames <- union(levels(col1), levels(col2))

#2


11  

x <- structure(list(col1 = structure(c(1L, 4L, 2L, 3L, 4L), .Label = c("Bob", "Frank", "Jim", "Tom"), class = "factor"), col2 = structure(c(3L, 1L, 2L, 1L, 1L), .Label = c("Bob", "Jane", "John"), class = "factor")), .Names = c("col1", "col2"), class = "data.frame", row.names = c(NA, -5L))

Make a simple union of factor names:

建立因子名称的简单联合:

both <- union(levels(x$col1), levels(x$col2))

And relevel the two factors:

并重新考虑这两个因素:

x$col1 <- factor(x$col1, levels=both)
x$col2 <- factor(x$col2, levels=both)

After editing: added example to make numeric values from factors

编辑后:添加了从因子中生成数值的示例

You could simply transform the factor levels to numeric values, e.g.:

您可以简单地将因子级别转换为数值,例如:

as.numeric(x$col1)

Or a more simpler, nicer solution based on @Gavin Simpson's hint below in one step:

或者更简单,更好的解决方案基于@Gavin Simpson的一步提示:

data.matrix(x)

#3


2  

Could have sworn this didn't work when I was writing the abomination below, but it does now:

当我在下面写下令人憎恶的事情时,我可以发誓这不起作用,但它现在做了:

## self contained example:
txt <- "col1   col2
Bob    John
Tom    Bob
Frank  Jane
Jim    Bob
Tom    Bob"
dat <- read.table(textConnection(txt), header = TRUE)

Just compute unique set of levels and coerce each colX to a factor:

只计算一组唯一的级别并将每个colX强制转换为一个因子:

> dat3 <- dat
> lev <- as.character(unique(unlist(sapply(dat, levels))))
> dat3 <- within(dat3, col1 <- factor(col1, levels = lev))
> dat3 <- within(dat3, col2 <- factor(col2, levels = lev))
> str(dat3)
'data.frame':   5 obs. of  2 variables:
 $ col1: Factor w/ 6 levels "Bob","Tom","Frank",..: 1 2 3 4 2
 $ col2: Factor w/ 6 levels "Bob","Tom","Frank",..: 5 1 6 1 1
> data.matrix(dat3)
     col1 col2
[1,]    1    5
[2,]    2    1
[3,]    3    6
[4,]    4    1
[5,]    2    1

[Original: to show how stupidly complex and obfuscated one can write R code it one tries really hard!] Not sure this is particularly elegant (and it isn't), but...

[原文:为了表明一个人可以编写R代码是如此愚蠢复杂和混淆,人们会非常努力!]不确定这是否特别优雅(并且它不是),但......

We first unlist the data:

我们首先取消列出数据:

tmp <- unlist(dat)

then compute the unique levels

然后计算独特的水平

lev <- as.character(unique(tmp))

and then restructure tmp (from above) back into the same dimensions as the original data, convert to data.frame (preserving the strings), lapply over this data frame, creating a factor with levels lev computed above, and finally coerce to a data frame.

然后将tmp(从上面)重新调整回与原始数据相同的维度,转换为data.frame(保留字符串),在此数据帧上进行填充,创建具有上面计算的级别lev的因子,最后强制转换为数据帧。

dat2 <- data.frame(lapply(data.frame(matrix(tmp, ncol = ncol(dat)), 
                                     stringsAsFactors = FALSE), 
                          FUN = factor, levels = lev))

Which gives:

这使:

> dat2
     X1   X2
1   Bob John
2   Tom  Bob
3 Frank Jane
4   Jim  Bob
5   Tom  Bob
> sapply(dat2, levels)
     X1      X2     
[1,] "Bob"   "Bob"  
[2,] "Tom"   "Tom"  
[3,] "Frank" "Frank"
[4,] "Jim"   "Jim"  
[5,] "John"  "John" 
[6,] "Jane"  "Jane" 
> data.matrix(dat2)
     X1 X2
[1,]  1  5
[2,]  2  1
[3,]  3  6
[4,]  4  1
[5,]  2  1