I have been asked to see if there is a linear trend in 3 groups of data (5 points each) by using ANOVA and linear contrasts. The 3 groups represent data collected in 2010
, 2011
and 2012
. I want to use R for this procedure and I have tried both of the following:
我被要求用方差分析和线性对比来观察三组数据(每组5分)是否存在线性趋势。这三组数据分别来自2010年、2011年和2012年的数据。我想在这个过程中使用R,我试过以下两种方法:
contrasts(data$groups, how.many=1) <- contr.poly(3)
contrasts(data$groups) <- contr.poly(3)
Both ways seem to work fine but give slightly different answers in terms of their p-values. I have no idea which is correct and it is really tricky to find help for this on the web. I would like help figuring out what is the reasoning behind the different answers. I'm not sure if it has something to do with partitioning sums of squares or whatnot.
这两种方法似乎都很有效,但在p值方面给出的答案略有不同。我不知道哪个是正确的,在网上找帮助真的很棘手。我想帮助大家找出不同答案背后的原因。我不确定它是否和划分平方和有关。
2 个解决方案
#1
3
Both approaches differ with respect to whether a quadratic polynomial is used.
对于是否使用二次多项式,这两种方法都不同。
For illustration purposes, have a look at this example, both x
and y
are a factor with three levels.
为了便于说明,请看这个例子,x和y都是具有三个层次的因子。
x <- y <- gl(3, 2)
# [1] 1 1 2 2 3 3
# Levels: 1 2 3
The first approach creates a contrast matrix for a quadratic polynomial, i.e., with a linear (.L
) and a quadratic trend (.Q
). The 3
means: Create the 3 - 1
th polynomial.
第一种方法为二次多项式创建一个对比矩阵,即。,有线性趋势(. l)和二次趋势(. q)。3的意思是:创建3 - 1多项式。
contrasts(x) <- contr.poly(3)
# [1] 1 1 2 2 3 3
# attr(,"contrasts")
# .L .Q
# 1 -7.071068e-01 0.4082483
# 2 -7.850462e-17 -0.8164966
# 3 7.071068e-01 0.4082483
# Levels: 1 2 3
In contrast, the second approach results in a polynomial of first order (i.e., a linear trend only). This is due to the argument how.many = 1
. Hence, only 1
contrast is created.
相比之下,第二种方法会得到一个一阶多项式(即这是一个线性趋势。这是由于争论的原因。许多= 1。因此,只创建了一个对比。
contrasts(y, how.many = 1) <- contr.poly(3)
# [1] 1 1 2 2 3 3
# attr(,"contrasts")
# .L
# 1 -7.071068e-01
# 2 -7.850462e-17
# 3 7.071068e-01
# Levels: 1 2 3
If you're interested in the linear trend only, the second option seems more appropriate for you.
如果你只对线性趋势感兴趣,第二种选择似乎更适合你。
#2
1
Changing the contrasts you ask for changes the degrees of freedom of the model. If one model requests linear and quadratic contrasts, and a second specifies only, say, the linear contrast, then the second model has an extra degree of freedom: this will increase the power to test the linear hypothesis, (at the cost of preventing the model fitting the quadratic trend).
改变你要求的对比改变模型的*度。如果一个模型要求线性和二次对比,而第二个模型只指定线性对比,那么第二个模型有一个额外的*度:这将增加测试线性假设的能力(以防止模型符合二次趋势为代价)。
Using the full ("nlevels - 1") set of contrasts creates an orthogonal set of contrasts which explore the full set of (independent) response configurations. Cutting back to just one prevents the model from fitting one configuration (in this case the quadratic component which our data in fact possess.
使用完整的(“nlevels - 1”)对比集创建一组正交的对比集,以探索完整的(独立的)响应配置集。减少到仅仅一个可以防止模型拟合一个配置(在这种情况下,我们的数据实际上拥有的二次组件)。
To see how this works, use the built-in dataset mtcars
, and test the (confounded) relationship of gears to gallons. We'll hypothesize that the more gears the better (at least up to some point).
要了解这是如何工作的,请使用内置的数据集mtcars,并测试(令人困惑的)齿轮与加仑的关系。我们假设齿轮越多越好(至少在某种程度上)。
df = mtcars # copy the dataset
df$gear = as.ordered(df$gear) # make an ordered factor
Ordered factors default to polynomial contrasts, but we'll set them here to be explicit:
有序因子默认为多项式对比,但这里我们将它们设置为显式:
contrasts(df$gear) <- contr.poly(nlevels(df$gear))
Then we can model the relationship.
然后我们可以建立关系模型。
m1 = lm(mpg ~ gear, data = df);
summary.lm(m1)
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 20.6733 0.9284 22.267 < 2e-16 ***
# gear.L 3.7288 1.7191 2.169 0.03842 *
# gear.Q -4.7275 1.4888 -3.175 0.00353 **
#
# Multiple R-squared: 0.4292, Adjusted R-squared: 0.3898
# F-statistic: 10.9 on 2 and 29 DF, p-value: 0.0002948
Note we have F(2,29) = 10.9 for the overall model and p=.038 for our linear effect with an estimated extra 3.7 mpg/gear.
注意,整个模型有F(2,29) = 10.9, p=。038为我们的线性效应估计额外3.7 mpg/gear。
Now let's only request the linear contrast, and run the "same" analysis.
现在我们只请求线性对比,并运行“相同”分析。
contrasts(df$gear, how.many = 1) <- contr.poly(nlevels(df$gear))
m1 = lm(mpg ~ gear, data = df)
summary.lm(m1)
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 21.317 1.034 20.612 <2e-16 ***
# gear.L 5.548 1.850 2.999 0.0054 **
# Multiple R-squared: 0.2307, Adjusted R-squared: 0.205
# F-statistic: 8.995 on 1 and 30 DF, p-value: 0.005401
The linear effect of gear is now bigger (5.5 mpg) and p << .05 - A win? Except the overall model fit is now significantly worse: variance accounted for is now just 23% (was 43%)! Why is clear if we plot the relationship:
现在齿轮的线性效应更大(5.5 mpg), p < 0.05 - win?除了整体模型拟合现在明显更糟:方差现在只占23% (43%)!如果我们画出这种关系,为什么是清晰的?
plot(mpg ~ gear, data = df) # view the relationship
So, if you're interested in the linear trend, but also expect (or are unclear about) additional levels of complexity, you should also test these higher polynomials. The quadratic (or, in general, trends up to levels-1).
所以,如果你对线性趋势感兴趣,但也期望(或不清楚)复杂度增加,你也应该测试这些更高的多项式。二次方程(或者,一般来说,趋向于1级)。
Note too that in this example the physical mechanism is confounded: We've forgotten that number of gears is confounded with automatic vs manual transmission, and also with weight, and sedan vs sports car.
还请注意,在这个例子中,物理机制是混乱的:我们已经忘记了齿轮的数量与自动vs手动变速箱,以及重量,轿车vs跑车。
If someone wants to test the hypothesis that 4 gears is better than 3, they could answer this question :-)
如果有人想验证4档比3档更好的假设,他们可以回答这个问题:-)
#1
3
Both approaches differ with respect to whether a quadratic polynomial is used.
对于是否使用二次多项式,这两种方法都不同。
For illustration purposes, have a look at this example, both x
and y
are a factor with three levels.
为了便于说明,请看这个例子,x和y都是具有三个层次的因子。
x <- y <- gl(3, 2)
# [1] 1 1 2 2 3 3
# Levels: 1 2 3
The first approach creates a contrast matrix for a quadratic polynomial, i.e., with a linear (.L
) and a quadratic trend (.Q
). The 3
means: Create the 3 - 1
th polynomial.
第一种方法为二次多项式创建一个对比矩阵,即。,有线性趋势(. l)和二次趋势(. q)。3的意思是:创建3 - 1多项式。
contrasts(x) <- contr.poly(3)
# [1] 1 1 2 2 3 3
# attr(,"contrasts")
# .L .Q
# 1 -7.071068e-01 0.4082483
# 2 -7.850462e-17 -0.8164966
# 3 7.071068e-01 0.4082483
# Levels: 1 2 3
In contrast, the second approach results in a polynomial of first order (i.e., a linear trend only). This is due to the argument how.many = 1
. Hence, only 1
contrast is created.
相比之下,第二种方法会得到一个一阶多项式(即这是一个线性趋势。这是由于争论的原因。许多= 1。因此,只创建了一个对比。
contrasts(y, how.many = 1) <- contr.poly(3)
# [1] 1 1 2 2 3 3
# attr(,"contrasts")
# .L
# 1 -7.071068e-01
# 2 -7.850462e-17
# 3 7.071068e-01
# Levels: 1 2 3
If you're interested in the linear trend only, the second option seems more appropriate for you.
如果你只对线性趋势感兴趣,第二种选择似乎更适合你。
#2
1
Changing the contrasts you ask for changes the degrees of freedom of the model. If one model requests linear and quadratic contrasts, and a second specifies only, say, the linear contrast, then the second model has an extra degree of freedom: this will increase the power to test the linear hypothesis, (at the cost of preventing the model fitting the quadratic trend).
改变你要求的对比改变模型的*度。如果一个模型要求线性和二次对比,而第二个模型只指定线性对比,那么第二个模型有一个额外的*度:这将增加测试线性假设的能力(以防止模型符合二次趋势为代价)。
Using the full ("nlevels - 1") set of contrasts creates an orthogonal set of contrasts which explore the full set of (independent) response configurations. Cutting back to just one prevents the model from fitting one configuration (in this case the quadratic component which our data in fact possess.
使用完整的(“nlevels - 1”)对比集创建一组正交的对比集,以探索完整的(独立的)响应配置集。减少到仅仅一个可以防止模型拟合一个配置(在这种情况下,我们的数据实际上拥有的二次组件)。
To see how this works, use the built-in dataset mtcars
, and test the (confounded) relationship of gears to gallons. We'll hypothesize that the more gears the better (at least up to some point).
要了解这是如何工作的,请使用内置的数据集mtcars,并测试(令人困惑的)齿轮与加仑的关系。我们假设齿轮越多越好(至少在某种程度上)。
df = mtcars # copy the dataset
df$gear = as.ordered(df$gear) # make an ordered factor
Ordered factors default to polynomial contrasts, but we'll set them here to be explicit:
有序因子默认为多项式对比,但这里我们将它们设置为显式:
contrasts(df$gear) <- contr.poly(nlevels(df$gear))
Then we can model the relationship.
然后我们可以建立关系模型。
m1 = lm(mpg ~ gear, data = df);
summary.lm(m1)
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 20.6733 0.9284 22.267 < 2e-16 ***
# gear.L 3.7288 1.7191 2.169 0.03842 *
# gear.Q -4.7275 1.4888 -3.175 0.00353 **
#
# Multiple R-squared: 0.4292, Adjusted R-squared: 0.3898
# F-statistic: 10.9 on 2 and 29 DF, p-value: 0.0002948
Note we have F(2,29) = 10.9 for the overall model and p=.038 for our linear effect with an estimated extra 3.7 mpg/gear.
注意,整个模型有F(2,29) = 10.9, p=。038为我们的线性效应估计额外3.7 mpg/gear。
Now let's only request the linear contrast, and run the "same" analysis.
现在我们只请求线性对比,并运行“相同”分析。
contrasts(df$gear, how.many = 1) <- contr.poly(nlevels(df$gear))
m1 = lm(mpg ~ gear, data = df)
summary.lm(m1)
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 21.317 1.034 20.612 <2e-16 ***
# gear.L 5.548 1.850 2.999 0.0054 **
# Multiple R-squared: 0.2307, Adjusted R-squared: 0.205
# F-statistic: 8.995 on 1 and 30 DF, p-value: 0.005401
The linear effect of gear is now bigger (5.5 mpg) and p << .05 - A win? Except the overall model fit is now significantly worse: variance accounted for is now just 23% (was 43%)! Why is clear if we plot the relationship:
现在齿轮的线性效应更大(5.5 mpg), p < 0.05 - win?除了整体模型拟合现在明显更糟:方差现在只占23% (43%)!如果我们画出这种关系,为什么是清晰的?
plot(mpg ~ gear, data = df) # view the relationship
So, if you're interested in the linear trend, but also expect (or are unclear about) additional levels of complexity, you should also test these higher polynomials. The quadratic (or, in general, trends up to levels-1).
所以,如果你对线性趋势感兴趣,但也期望(或不清楚)复杂度增加,你也应该测试这些更高的多项式。二次方程(或者,一般来说,趋向于1级)。
Note too that in this example the physical mechanism is confounded: We've forgotten that number of gears is confounded with automatic vs manual transmission, and also with weight, and sedan vs sports car.
还请注意,在这个例子中,物理机制是混乱的:我们已经忘记了齿轮的数量与自动vs手动变速箱,以及重量,轿车vs跑车。
If someone wants to test the hypothesis that 4 gears is better than 3, they could answer this question :-)
如果有人想验证4档比3档更好的假设,他们可以回答这个问题:-)