如何确定列是数据框中的类因子?

时间:2021-09-15 16:32:14

On creating a column whose contents contain duplicate values, I notice the following with regard to factors.

在创建内容包含重复值的列时,我会注意到以下因素。

1.If a column with duplicate character values is made part of a data frame at the time of data frame creation, it is of class factor, but if the same column is appended later, it is of class character though the values in both cases are the same. Why is this?

1.如果在创建数据帧时将具有重复字符值的列作为数据帧的一部分,则它是类因子,但如果稍后添加相同的列,则它是类字符,尽管两种情况下的值都是是相同的。为什么是这样?

#creating a data frame
name = c('waugh','waugh','smith')
age = c(21,21,27)
df = data.frame(name,age)

#adding a new column which has the same values as the 'name' column above, to the data frame
df$newcol = c('waugh','waugh','smith')

#you can see that the class'es of the two are different though the values are same
class(df$name)
## [1] "factor"
class(df$newcol)
## [1] "character"
  1. Only the column which has duplicate alphabetic contents becomes a factor; If a column contains duplicate numeric values, it is not treated as a factor. Why is that? I could very well mean that 1-Male, 0-Female, in which case, it should be a factor?

    只有具有重复字母内容的列才成为一个因素;如果列包含重复的数值,则不将其视为因子。这是为什么?我很可能意味着1-Male,0-Female,在这种情况下,它应该是一个因素?

    note that both these columns contain duplicate values

    class(df$name)
    ## [1] "factor"
    class(df$age)
    ## [1] "numeric"
    

1 个解决方案

#1


1  

This was basically answered in the comments, but i'll put the answer here to close out the question.

这在评论中基本得到了解答,但我会在这里给出答案来解决问题。

When you use data.frame() to create a data.frame, that function actually manipulates the arguments you pass in to create the data.frame object. Specifically, by default, it has a parameter named stringsAsFactors=TRUE so that it will take all character vectors you pass in and convert them to factor vectors since normally you treat these values as categorical random variables in various statistical tests and it can be more efficient to store character values as a factor if you have many values that are repeated in the vector.

当您使用data.frame()创建data.frame时,该函数实际操作您传入的参数以创建data.frame对象。具体来说,默认情况下,它有一个名为stringsAsFactors = TRUE的参数,因此它将获取您传入的所有字符向量并将它们转换为因子向量,因为通常您在各种统计测试中将这些值视为分类随机变量,并且它可以更有效如果您有许多在向量中重复的值,则将字符值存储为因子。

df <- data.frame(name,age)
class(df$name)
# [1] "factor"
df <- data.frame(name,age, stringsAsFactors=FALSE)
class(df$name)
# [1] "character"

Note that the data.frame itself doesn't remember the "stringsAsFactors" value used during its construction. This is only used when you actually run data.frame(). So if you add columns by assigning them via the $<- syntax or cbind(), the coercion will not happen

请注意,data.frame本身不记得构造期间使用的“stringsAsFactors”值。这仅在您实际运行data.frame()时使用。因此,如果通过$ < - syntax或cbind()分配列来添加列,则不会发生强制

df1 <- data.frame(name,age)
df2 <- data.frame(name,age, stringsAsFactors=FALSE)
df1$name2 <- name
df2$name2 <- name
df3 <- cbind(data.frame(name,age), name2=name)
class(df1$name2)
# [1] "character"
class(df2$name2)
# [1] "character"
class(df3$name2) 
# [1] "character"

If you want to add the column as a factor, you will need to convert to factor yourself

如果要将列添加为因子,则需要自行转换为因子

df = data.frame(name,age)
df$name2 <- factor(name)
class(df$name2)
# [1] "factor"

#1


1  

This was basically answered in the comments, but i'll put the answer here to close out the question.

这在评论中基本得到了解答,但我会在这里给出答案来解决问题。

When you use data.frame() to create a data.frame, that function actually manipulates the arguments you pass in to create the data.frame object. Specifically, by default, it has a parameter named stringsAsFactors=TRUE so that it will take all character vectors you pass in and convert them to factor vectors since normally you treat these values as categorical random variables in various statistical tests and it can be more efficient to store character values as a factor if you have many values that are repeated in the vector.

当您使用data.frame()创建data.frame时,该函数实际操作您传入的参数以创建data.frame对象。具体来说,默认情况下,它有一个名为stringsAsFactors = TRUE的参数,因此它将获取您传入的所有字符向量并将它们转换为因子向量,因为通常您在各种统计测试中将这些值视为分类随机变量,并且它可以更有效如果您有许多在向量中重复的值,则将字符值存储为因子。

df <- data.frame(name,age)
class(df$name)
# [1] "factor"
df <- data.frame(name,age, stringsAsFactors=FALSE)
class(df$name)
# [1] "character"

Note that the data.frame itself doesn't remember the "stringsAsFactors" value used during its construction. This is only used when you actually run data.frame(). So if you add columns by assigning them via the $<- syntax or cbind(), the coercion will not happen

请注意,data.frame本身不记得构造期间使用的“stringsAsFactors”值。这仅在您实际运行data.frame()时使用。因此,如果通过$ < - syntax或cbind()分配列来添加列,则不会发生强制

df1 <- data.frame(name,age)
df2 <- data.frame(name,age, stringsAsFactors=FALSE)
df1$name2 <- name
df2$name2 <- name
df3 <- cbind(data.frame(name,age), name2=name)
class(df1$name2)
# [1] "character"
class(df2$name2)
# [1] "character"
class(df3$name2) 
# [1] "character"

If you want to add the column as a factor, you will need to convert to factor yourself

如果要将列添加为因子,则需要自行转换为因子

df = data.frame(name,age)
df$name2 <- factor(name)
class(df$name2)
# [1] "factor"