I was trying out linear regression with R using categorical attributes and observe that I don't get a coefficient value for each of the different factor levels I have.
我使用分类属性尝试使用R进行线性回归,并观察到我没有获得每个不同因子水平的系数值。
Please see my code below, I have 5 factor levels for states, but see only 4 values of co-efficients.
请参阅下面的代码,我有5个状态因子,但只能看到4个系数值。
> states = c("WA","TE","GE","LA","SF")
> population = c(0.5,0.2,0.6,0.7,0.9)
> df = data.frame(states,population)
> df
states population
1 WA 0.5
2 TE 0.2
3 GE 0.6
4 LA 0.7
5 SF 0.9
> states=NULL
> population=NULL
> lm(formula=population~states,data=df)
Call:
lm(formula = population ~ states, data = df)
Coefficients:
(Intercept) statesLA statesSF statesTE statesWA
0.6 0.1 0.3 -0.4 -0.1
I also tried with a larger data set by doing the following, but still see the same behavior
我还通过执行以下操作尝试使用更大的数据集,但仍然看到相同的行为
for(i in 1:10)
{
df = rbind(df,df)
}
EDIT : Thanks to responses from eipi10, MrFlick and economy. I now understand one of the levels is being used as reference level. But when I get a new test data whose state's value is "GE", how do I substitute in the equation y=m1x1+m2x2+...+c ?
编辑:感谢eipi10,MrFlick和经济的回应。我现在明白其中一个级别被用作参考级别。但是当我得到一个状态值为“GE”的新测试数据时,如何用等式y = m1x1 + m2x2 + ... + c替换?
I also tried flattening out the data such that each of these factor levels gets it's separate column, but again for one of the column, I get NA as coefficient. If I have a new test data whose state is 'WA', how can I get the 'population value'? What do I substitute as it's coefficient?
我也尝试将数据展平,使得每个因子级别都得到它的单独列,但是对于其中一个列,我得到NA作为系数。如果我有一个状态为'WA'的新测试数据,我怎样才能获得'人口价值'?我用什么替代它的系数?
> df1
population GE MI TE WA 1 1 0 0 0 1 2 2 1 0 0 0 3 2 0 0 1 0 4 1 0 1 0 0
人口GE MI TE WA 1 1 0 0 0 1 2 2 1 0 0 0 3 2 0 0 1 0 4 1 0 1 0 0
lm(formula = population ~ (GE+MI+TE+WA),data=df1)
lm(公式=人口〜(GE + MI + TE + WA),数据= df1)
Call:
lm(formula = population ~ (GE + MI + TE + WA), data = df1)
Coefficients:
(Intercept) GE MI TE WA
1 1 0 1 NA
1 个解决方案
#1
5
GE
is dropped, alphabetically, as the intercept term. As eipi10 stated, you can interpret the coefficients for the other levels in states
with GE
as the baseline (statesLA = 0.1
meaning LA is, on average, 0.1x more than GE).
按字母顺序删除GE作为拦截术语。正如eipi10所述,你可以解释以GE为基线的状态中其他水平的系数(状态LA = 0.1意味着LA平均比GE多0.1倍)。
EDIT:
To respond to your updated question:
要回复您更新的问题:
If you include all of the levels in a linear regression, you're going to have a situation called perfect collinearity, which is responsible for the strange results you're seeing when you force each category into its own variable. I won't get into the explanation of that, just find a wiki, and know that linear regression doesn't work if the variable coefficients are completely represented (and you're also expecting an intercept term). If you want to see all of the levels in a regression, you can perform a regression without an intercept term, as suggested in the comments, but again, this is ill-advised unless you have a specific reason to.
如果在线性回归中包含所有级别,那么您将会遇到称为完全共线性的情况,这会导致您将每个类别强制转换为自己的变量时所看到的奇怪结果。我不会对此进行解释,只需找到一个wiki,并且知道如果变量系数被完全表示(并且你也期望一个截距项)线性回归不起作用。如果您想要查看回归中的所有级别,您可以按照评论中的建议执行没有拦截术语的回归,但同样,除非您有特定原因,否则这是不明智的。
As for the interpretation of GE
in your y=mx+c
equation, you can calculate the expected y
by knowing that the levels of the other states are binary (zero or one), and if the state is GE, they will all be zero.
至于你的y = mx + c方程中对GE的解释,你可以通过知道其他状态的水平是二进制(零或一)来计算预期的y,如果状态是GE,它们都将为零。
e.g.
y = x1b1 + x2b2 + x3b3 + c
y = b1(0) + b2(0) + b3(0) + c
y = c
If you don't have any other variables, like in your first example, the effect of GE will be equal to the intercept term (0.6).
如果您没有任何其他变量,例如在第一个示例中,GE的效果将等于截距项(0.6)。
#1
5
GE
is dropped, alphabetically, as the intercept term. As eipi10 stated, you can interpret the coefficients for the other levels in states
with GE
as the baseline (statesLA = 0.1
meaning LA is, on average, 0.1x more than GE).
按字母顺序删除GE作为拦截术语。正如eipi10所述,你可以解释以GE为基线的状态中其他水平的系数(状态LA = 0.1意味着LA平均比GE多0.1倍)。
EDIT:
To respond to your updated question:
要回复您更新的问题:
If you include all of the levels in a linear regression, you're going to have a situation called perfect collinearity, which is responsible for the strange results you're seeing when you force each category into its own variable. I won't get into the explanation of that, just find a wiki, and know that linear regression doesn't work if the variable coefficients are completely represented (and you're also expecting an intercept term). If you want to see all of the levels in a regression, you can perform a regression without an intercept term, as suggested in the comments, but again, this is ill-advised unless you have a specific reason to.
如果在线性回归中包含所有级别,那么您将会遇到称为完全共线性的情况,这会导致您将每个类别强制转换为自己的变量时所看到的奇怪结果。我不会对此进行解释,只需找到一个wiki,并且知道如果变量系数被完全表示(并且你也期望一个截距项)线性回归不起作用。如果您想要查看回归中的所有级别,您可以按照评论中的建议执行没有拦截术语的回归,但同样,除非您有特定原因,否则这是不明智的。
As for the interpretation of GE
in your y=mx+c
equation, you can calculate the expected y
by knowing that the levels of the other states are binary (zero or one), and if the state is GE, they will all be zero.
至于你的y = mx + c方程中对GE的解释,你可以通过知道其他状态的水平是二进制(零或一)来计算预期的y,如果状态是GE,它们都将为零。
e.g.
y = x1b1 + x2b2 + x3b3 + c
y = b1(0) + b2(0) + b3(0) + c
y = c
If you don't have any other variables, like in your first example, the effect of GE will be equal to the intercept term (0.6).
如果您没有任何其他变量,例如在第一个示例中,GE的效果将等于截距项(0.6)。