理解将因子强制转换为R数据帧中的字符

时间:2022-02-17 13:02:03

Trying to figure out how coercion of factors/ dataframe works in R. I am trying to plot boxplots for a subset of a dataframe. Let's see step-by-step

试图弄清楚因子/数据帧的强制如何在R中起作用。我试图绘制数据帧子集的箱线图。让我们一步一步看

x = rnorm(30, 1, 1)

Created a vector x with normal distribution

创建了具有正态分布的向量x

c = c(rep("x1",10), rep("x2",10), rep("x3",10))

Created a character string to later use as a factor for plotting boxplots for x1, x2, x3

创建一个字符串,以便稍后用作绘制x1,x2,x3的箱线图的因子

df = data.frame(x,c)

combined x and c into a data.frame. So now we would expect class of df: dataframe, df$x: numeric, df$c: factor (because we sent c into a dataframe) and is.data.frame and is.list applied on df should give us TRUE and TRUE. (I assumed that all dataframes are lists as well? and that's why we are getting TRUE for both checks.)

将x和c组合成一个data.frame。所以现在我们期望df的类:dataframe,df $ x:numeric,df $ c:factor(因为我们将c发送到数据帧)和is.data.frame以及应用于df的is.list应该给我们TRUE和TRUE 。 (我假设所有数据帧都是列表?这就是我们为两个检查得到TRUE的原因。)

And that's what happens below. All good till now.

这就是下面发生的事情。一切都好到现在为止。

class(df)
#[1] "data.frame"
is.data.frame(df)
#[1] TRUE
is.list(df)
#[1] TRUE
class(df$x)
#[1] "numeric"
class(df$c)
#[1] "factor"

Now I plot the spread of x grouped using factors present in c. So the first argument is x ~ c. But I want boxplots for just two factors: x1and x2. So I used a subset argument in boxplot function.

现在我使用c中存在的因子绘制x的扩散。所以第一个参数是x~c。但我只想要两个因素的箱形图:x1和x2。所以我在boxplot函数中使用了一个子集参数。

boxplot(x ~ c, subset=c %in% c("x1", "x2"), data=df)

This is the plot we get, notice since x3 is a factor, it is still plotted i.e. we still got 3 categories on x-axis of the boxplot inspite of subsetting to 2 categories.

这是我们得到的情节,注意因为x3是一个因子,它仍然被绘制,即我们仍然在箱图的x轴上有3个类别,尽管有2个类别的子集。

So, one solution I found was to change the class of df variables into numeric and character

因此,我发现一个解决方案是将df变量的类更改为数字和字符

class(df)<- c("numeric", "character")

boxplot(x ~ c, subset=c %in% c("x1", "x2"), data=df)

New boxplot. This is what we wanted, so it worked!, we plotted boxes for just x1 and x2, got rid of x3

新的boxplot。这就是我们想要的,所以它有效!我们只为x1和x2绘制了框,摆脱了x3

But if we just run the same checks, we ran before doing this coercion, on all variables, we get these outputs.

但是如果我们只是运行相同的检查,我们在执行此强制操作之前运行,在所有变量上,我们得到这些输出。

Anything funny?

class(df)
#[1] "numeric"   "character"
is.data.frame(df)
#[1] FALSE
is.list(df)
#[1] TRUE
class(df$x)
#[1] "numeric"
class(df$c)
#[1] "factor"

Check out that df $ c (the second variable containing caegories x1, x2, x3) is still a factor!

看看df $ c(包含caegories x1,x2,x3的第二个变量)仍然是一个因素!

And df stopped being a list (so was it ever a list?)

并且df不再是一个列表(所以它曾经是一个列表吗?)

And what did we do exactly by class(df)<- c("numeric", "character") this coercion if not changing the datatype of df $ c?

如果不改变df $ c的数据类型,那么我们通过类(df)< - c(“数字”,“字符”)确切地做了什么?

So to sum up,

my questions for tldr version:

我对tldr版本的疑问:

  • Are all dataframes, also lists in R?

    是否所有数据帧都列在R中?

  • Why did our boxplot dropped x3 in the 2nd case (when we coerced class(df) into numeric and character?

    为什么我们的boxplot在第二种情况下掉落了x3(当我们将类(df)强制转换为数字和字符时?

  • If we did coerce factor into characters by doing the above steps, why is still showing that variable's class is factor?

    如果我们通过执行上述步骤将因子强制转换为字符,为什么仍然显示变量的类是因子?

  • And why did df stopped being a dataframe after we did the above steps?

    为什么在我们完成上述步骤后df停止成为数据帧?

2 个解决方案

#1


0  

The answers make more sense if we take your questions in a different order.

如果我们以不同的顺序提出您的问题,答案会更有意义。

Are all dataframes, also lists in R?

是否所有数据帧都列在R中?

Yes. A data frame is a list of vectors (the columns).

是。数据框是向量列表(列)。

And why did df stopped being a list after we did the above steps?

在我们完成上述步骤后,为什么df不再列入清单?

It didn't. It stopped being a data frame, because you changed the class with class(df)<- c("numeric", "character"). is.list(df) returns TRUE still.

它没有。它不再是数据框,因为您使用类(df)< - c(“numeric”,“character”)更改了类。 is.list(df)仍然返回TRUE。

If we did coerce factor into characters by doing the above steps, why is still showing that variable's class is factor?

如果我们通过执行上述步骤将因子强制转换为字符,为什么仍然显示变量的类是因子?

class(df) operates on the df object itself, not the columns. Look at str(df). The factor column is still a factor. class(df) set the class attribute on the data frame object itself to a vector.

class(df)对df对象本身进行操作,而不对列进行操作。看看str(df)。因子列仍然是一个因素。 class(df)将数据框对象本身的class属性设置为vector。

Why did our boxplot dropped x3 in the 2nd case (when we coerced class(df) into numeric and character?

为什么我们的boxplot在第二种情况下掉落了x3(当我们将类(df)强制转换为数字和字符时?

You've messed up your data frame object by explicitly setting the class attribute of the object to a vector c("numeric", "character"). It's hard to predict the full effects of this. My best guess is that boxplot or the functions that draw the axes accessed the class attribute of the data frame somehow.

您通过将对象的class属性显式设置为向量c(“numeric”,“character”)来搞乱数据框对象。很难预测这种情况的全部影响。我最好的猜测是,boxplot或绘制轴的函数以某种方式访问​​数据框的class属性。

To do what you really wanted:

做你真正想要的事:

x = rnorm(30, 1, 1)
c = c(rep("x1",10), rep("x2",10), rep("x3",10))
df = data.frame(x,c)
df$c <- as.character(df$c)

or

x = rnorm(30, 1, 1)
c = c(rep("x1",10), rep("x2",10), rep("x3",10))
df = data.frame(x,c, stringsAsFactors=FALSE)

#2


0  

Use droplevels like this:

使用这样的droplevel:

df0 <- subset(df, c %in% c("x1", "x2"))
df0 <- transform(df0, c = droplevels(c))
levels(df0$c)
## [1] "x1" "x2"

Note that now c only has two levels, not three.

请注意,现在c只有两个级别,而不是三个级别。

We can write this as a pipeline using magrittr like this:

我们可以使用magrittr将其写为管道,如下所示:

library(magrittr)

df %>%
   subset(c %in% c("x1", "x2")) %>%
   transform(c = droplevels(c)) %>%
   boxplot(x ~ c, data = .)

理解将因子强制转换为R数据帧中的字符

#1


0  

The answers make more sense if we take your questions in a different order.

如果我们以不同的顺序提出您的问题,答案会更有意义。

Are all dataframes, also lists in R?

是否所有数据帧都列在R中?

Yes. A data frame is a list of vectors (the columns).

是。数据框是向量列表(列)。

And why did df stopped being a list after we did the above steps?

在我们完成上述步骤后,为什么df不再列入清单?

It didn't. It stopped being a data frame, because you changed the class with class(df)<- c("numeric", "character"). is.list(df) returns TRUE still.

它没有。它不再是数据框,因为您使用类(df)< - c(“numeric”,“character”)更改了类。 is.list(df)仍然返回TRUE。

If we did coerce factor into characters by doing the above steps, why is still showing that variable's class is factor?

如果我们通过执行上述步骤将因子强制转换为字符,为什么仍然显示变量的类是因子?

class(df) operates on the df object itself, not the columns. Look at str(df). The factor column is still a factor. class(df) set the class attribute on the data frame object itself to a vector.

class(df)对df对象本身进行操作,而不对列进行操作。看看str(df)。因子列仍然是一个因素。 class(df)将数据框对象本身的class属性设置为vector。

Why did our boxplot dropped x3 in the 2nd case (when we coerced class(df) into numeric and character?

为什么我们的boxplot在第二种情况下掉落了x3(当我们将类(df)强制转换为数字和字符时?

You've messed up your data frame object by explicitly setting the class attribute of the object to a vector c("numeric", "character"). It's hard to predict the full effects of this. My best guess is that boxplot or the functions that draw the axes accessed the class attribute of the data frame somehow.

您通过将对象的class属性显式设置为向量c(“numeric”,“character”)来搞乱数据框对象。很难预测这种情况的全部影响。我最好的猜测是,boxplot或绘制轴的函数以某种方式访问​​数据框的class属性。

To do what you really wanted:

做你真正想要的事:

x = rnorm(30, 1, 1)
c = c(rep("x1",10), rep("x2",10), rep("x3",10))
df = data.frame(x,c)
df$c <- as.character(df$c)

or

x = rnorm(30, 1, 1)
c = c(rep("x1",10), rep("x2",10), rep("x3",10))
df = data.frame(x,c, stringsAsFactors=FALSE)

#2


0  

Use droplevels like this:

使用这样的droplevel:

df0 <- subset(df, c %in% c("x1", "x2"))
df0 <- transform(df0, c = droplevels(c))
levels(df0$c)
## [1] "x1" "x2"

Note that now c only has two levels, not three.

请注意,现在c只有两个级别,而不是三个级别。

We can write this as a pipeline using magrittr like this:

我们可以使用magrittr将其写为管道,如下所示:

library(magrittr)

df %>%
   subset(c %in% c("x1", "x2")) %>%
   transform(c = droplevels(c)) %>%
   boxplot(x ~ c, data = .)

理解将因子强制转换为R数据帧中的字符