根据单元格值删除列联表的行

时间:2022-06-19 22:23:41

I have a data frame with approximately 20,000 observations. From this I've created a contingency table with frequencies of two variables.

我有一个大约20,000个观测数据框。由此我创建了一个频率为两个变量的列联表。

With this I want to perform a chi-squared test of independence to see if there is a relationship between my two variables. Ordinarily this is easy but many cells have expected values of 0, despite the large size of the original data frame. I want to delete any rows that contain a frequency less than 5.

有了这个,我想进行卡方检验的独立性,看看我的两个变量之间是否存在关系。通常这很容易,但是许多单元的预期值为0,尽管原始数据帧的大小很大。我想删除任何包含频率小于5的行。

I've searched stack exchange extensively but I can't find a solution to this specific problem that I either a) understand (I'm relatively new to R), or b) that works with a contingency table rather than the original data frame.

我已经广泛搜索了堆栈交换但我无法找到这个特定问题的解决方案,我或者a)理解(我对R来说比较新),或者b)使用列联表而不是原始数据帧。

Any help greatly appreciated.

任何帮助非常感谢。

Edit:

编辑:

Thanks for your response Justin.

谢谢你的回应贾斯汀。

As requested, I've uploaded extracts of the dataframe and contingency table. I've also uploaded the small amount of code I've tried so far, with results.

根据要求,我上传了数据框和列联表的摘录。我还上传了迄今为止尝试过的少量代码,结果如下。

Dataframe

数据帧

Department Super
AAP     1
ACS     4
ACE     1
AMA     1
APS     3
APS     2
APS     1
APS     1
ARC     5
ARC     7
ARC     1
BIB     6
BIB     6
BMS     2

So there are two columns, the first a three-letter department code and the second a one digit integer (1-7).

所以有两列,第一列是三个字母的部门代码,第二列是一个数字整数(1-7)。

Contingency Table

列联表

table(department,super)

        1    2   3   4   5   6   7   8
ACS     32  10   7  24  50   7  24  14
AMA      0   4   2   6  10   3  11   1
...

So a standard contingency table with frequencies.

所以带有频率的标准列联表。

So far I know I can create a logical test which tests if the cell contents is less than 5:

到目前为止,我知道我可以创建一个逻辑测试,测试单元格内容是否小于5:

depSupCrosstab <- depSupCrosstab[,2:8]>5

What I don't know is how to use the matrix that this line of code creates to drop whole rows if they have any FALSE entries.

我不知道的是如何使用这行代码创建的矩阵,如果它们有任何FALSE条目,则删除整行。

Hope that helps. I'm afraid I'm new at this, but there's only one way to learn...

希望有所帮助。我担心我是新手,但只有一种方法可以学习......

2 个解决方案

#1


0  

I think I've found the answer in a related question. apply is your friend in this case, as it can iterate over cols or rows.

我想我在相关问题中找到了答案。在这种情况下,apply是你的朋友,因为它可以迭代cols或rows。

To create an analogous data frame to yours and then select only rows where all cols are > 5, one can use the following:

要为您创建一个类似的数据框,然后只选择所有cols> 5的行,可以使用以下内容:

set.seed(1985)
tosub <- data.frame(matrix(round(runif(n = 80, min = 0, max = 100)), ncol = 8))
head(tosub,2)
x <- apply(tosub[,1:8] > 5, MARGIN = 1, all)
summary(x)
tosub[which(x),]

   X1 X2 X3 X4 X5 X6 X7 X8
1  66 30 72 59 26 69 76 47
2  27 42 26 95 66 14 67 18
4  42 28 93  7 35 35 95 23
5  38 89 69 91 98 91 60 69
9  89 31 91 72 28 31 58 58
10 53 87 27 89 95 37 98 20

#2


1  

I am afraid that your problem is more complex. The assumption of the chi-square test is that the expected frequency for each cell is more than 5. In your example you are trying to select a count of each cell of the contingency table, which is the observed frequency. The expected frequency (under the null hypothesis) is calculated from the row and column total counts as shown in the basic example here.

我担心你的问题会更复杂。卡方检验的假设是每个单元的预期频率大于5.在您的示例中,您试图选择列联表的每个单元的计数,即观察到的频率。预期频率(在零假设下)根据行和列总计数计算,如此处的基本示例所示。

To follow your example, a hypothetical contingency table may look like:

按照您的示例,假设的列联表可能如下所示:

ACS <- c(32, 10, 7, 24, 50, 7, 24, 14)
AMA <- c(0, 4, 2, 6, 10, 3, 11, 1)
ARC <- c(6, 10, 12, 3, 12, 23, 10, 2)

tab <- rbind(ACS, AMA, ARC)

If you screen for observed counts equal or less than 5, you would remove AMA and ARC:

如果您筛选的观察计数等于或小于5,您将删除AMA和ARC:

apply(tab,1, function(x) any(x<=5))

  ACS   AMA   ARC 
FALSE  TRUE  TRUE 

This is conceptually wrong, because as mentioned above the expected frequencies depend on the whole data. To obtain the exp. counts:

这在概念上是错误的,因为如上所述,预期频率取决于整个数据。获得exp。计数:

chisq.test(tab, correct=F)$expected

         [,1]      [,2]      [,3]      [,4]      [,5]      [,6]      [,7]
ACS 22.558304 14.247350 12.466431 19.590106 42.742049 19.590106 26.713781
AMA  4.968198  3.137809  2.745583  4.314488  9.413428  4.314488  5.883392
ARC 10.473498  6.614841  5.787986  9.095406 19.844523  9.095406 12.402827
         [,8]
ACS 10.091873
AMA  2.222615
ARC  4.685512

Warning message:
In chisq.test(tab, correct = F): Chi-squared approximation may be incorrect

Chi-square test issues a warning message because indeed there are some cells with exp. counts less than 5. But, if you remove only AMA, the dynamic (row and column totals) of the table changes and all of the exp. counts are above 5:

卡方检验发出警告信息,因为确实存在一些带有exp的单元格。计数小于5.但是,如果仅删除AMA,则表的动态(行和列总计)将更改以及所有exp。计数高于5:

chisq.test(tab[-2,], correct=F)$expected

        [,1]      [,2]     [,3]      [,4]     [,5]      [,6]     [,7]
ACS 25.95122 13.658537 12.97561 18.439024 42.34146 20.487805 23.21951
ARC 12.04878  6.341463  6.02439  8.560976 19.65854  9.512195 10.78049
         [,8]
ACS 10.926829
ARC  5.073171

So, if you remove both AMA and ARC you would loose an important information.

因此,如果您同时删除AMA和ARC,则会丢失重要信息。


You may try to run Fisher's exact test (see the explanation below):

您可以尝试运行Fisher精确检验(请参阅下面的说明):

fisher.test(tab,simulate.p.value=TRUE,B=10000)

To conclude:

总结:

  1. The individual observed frequencies are poor indicator of the expected frequencies. It is possible that an observed frequency is below 5, yet the expected frequency for that cell will be above 5.
  2. 观察到的各个频率是预期频率的不良指标。观察到的频率可能低于5,但该电池的预期频率将高于5。
  3. In large contingency tables, it is acceptable to have up to 20% of exp. frequencies below 5, but the result is a loss of statistical power, so the test may fail to detect a genuine effect. Even in that case, the exp. frequencies shouldn't be below 1.
  4. 在大型列联表中,最多可达到20%的exp。低于5的频率,但结果是统计功率的损失,因此测试可能无法检测到真正的效果。即使在那种情况下,exp。频率不应低于1。
  5. An alternative test for low exp. frequencies is Fisher's exact test. The chi-square test statistic approximates chi-square distribution. If the sample size is large, this approximation becomes more accurate, hence the requirement for exp. frequencies > 5. The Fisher's exact test computes the exact probability of the chi-square statistic even when the sample size is small, however it may be more computationally intensive. Unfortunately, for contingency tables larger than 2x2 you may need to simulate the p-values, which has it's own limitations (no space to discuss it here, but it's a good research subject). Select large number of replicates for simulation (B), and adjust it to see how robust your solution is.
  6. 低exp的替代测试。频率是Fisher的精确测试。卡方检验统计量近似于卡方分布。如果样本量很大,则这种近似变得更准确,因此需要exp。频率> 5.Fisher精确检验计算卡方统计量的准确概率,即使样本量很小,但它可能计算量更大。不幸的是,对于大于2x2的列联表,你可能需要模拟p值,这有其自身的局限性(这里没有讨论它的空间,但它是一个很好的研究课题)。为模拟(B)选择大量重复,并调整它以查看您的解决方案的稳健性。

#1


0  

I think I've found the answer in a related question. apply is your friend in this case, as it can iterate over cols or rows.

我想我在相关问题中找到了答案。在这种情况下,apply是你的朋友,因为它可以迭代cols或rows。

To create an analogous data frame to yours and then select only rows where all cols are > 5, one can use the following:

要为您创建一个类似的数据框,然后只选择所有cols> 5的行,可以使用以下内容:

set.seed(1985)
tosub <- data.frame(matrix(round(runif(n = 80, min = 0, max = 100)), ncol = 8))
head(tosub,2)
x <- apply(tosub[,1:8] > 5, MARGIN = 1, all)
summary(x)
tosub[which(x),]

   X1 X2 X3 X4 X5 X6 X7 X8
1  66 30 72 59 26 69 76 47
2  27 42 26 95 66 14 67 18
4  42 28 93  7 35 35 95 23
5  38 89 69 91 98 91 60 69
9  89 31 91 72 28 31 58 58
10 53 87 27 89 95 37 98 20

#2


1  

I am afraid that your problem is more complex. The assumption of the chi-square test is that the expected frequency for each cell is more than 5. In your example you are trying to select a count of each cell of the contingency table, which is the observed frequency. The expected frequency (under the null hypothesis) is calculated from the row and column total counts as shown in the basic example here.

我担心你的问题会更复杂。卡方检验的假设是每个单元的预期频率大于5.在您的示例中,您试图选择列联表的每个单元的计数,即观察到的频率。预期频率(在零假设下)根据行和列总计数计算,如此处的基本示例所示。

To follow your example, a hypothetical contingency table may look like:

按照您的示例,假设的列联表可能如下所示:

ACS <- c(32, 10, 7, 24, 50, 7, 24, 14)
AMA <- c(0, 4, 2, 6, 10, 3, 11, 1)
ARC <- c(6, 10, 12, 3, 12, 23, 10, 2)

tab <- rbind(ACS, AMA, ARC)

If you screen for observed counts equal or less than 5, you would remove AMA and ARC:

如果您筛选的观察计数等于或小于5,您将删除AMA和ARC:

apply(tab,1, function(x) any(x<=5))

  ACS   AMA   ARC 
FALSE  TRUE  TRUE 

This is conceptually wrong, because as mentioned above the expected frequencies depend on the whole data. To obtain the exp. counts:

这在概念上是错误的,因为如上所述,预期频率取决于整个数据。获得exp。计数:

chisq.test(tab, correct=F)$expected

         [,1]      [,2]      [,3]      [,4]      [,5]      [,6]      [,7]
ACS 22.558304 14.247350 12.466431 19.590106 42.742049 19.590106 26.713781
AMA  4.968198  3.137809  2.745583  4.314488  9.413428  4.314488  5.883392
ARC 10.473498  6.614841  5.787986  9.095406 19.844523  9.095406 12.402827
         [,8]
ACS 10.091873
AMA  2.222615
ARC  4.685512

Warning message:
In chisq.test(tab, correct = F): Chi-squared approximation may be incorrect

Chi-square test issues a warning message because indeed there are some cells with exp. counts less than 5. But, if you remove only AMA, the dynamic (row and column totals) of the table changes and all of the exp. counts are above 5:

卡方检验发出警告信息,因为确实存在一些带有exp的单元格。计数小于5.但是,如果仅删除AMA,则表的动态(行和列总计)将更改以及所有exp。计数高于5:

chisq.test(tab[-2,], correct=F)$expected

        [,1]      [,2]     [,3]      [,4]     [,5]      [,6]     [,7]
ACS 25.95122 13.658537 12.97561 18.439024 42.34146 20.487805 23.21951
ARC 12.04878  6.341463  6.02439  8.560976 19.65854  9.512195 10.78049
         [,8]
ACS 10.926829
ARC  5.073171

So, if you remove both AMA and ARC you would loose an important information.

因此,如果您同时删除AMA和ARC,则会丢失重要信息。


You may try to run Fisher's exact test (see the explanation below):

您可以尝试运行Fisher精确检验(请参阅下面的说明):

fisher.test(tab,simulate.p.value=TRUE,B=10000)

To conclude:

总结:

  1. The individual observed frequencies are poor indicator of the expected frequencies. It is possible that an observed frequency is below 5, yet the expected frequency for that cell will be above 5.
  2. 观察到的各个频率是预期频率的不良指标。观察到的频率可能低于5,但该电池的预期频率将高于5。
  3. In large contingency tables, it is acceptable to have up to 20% of exp. frequencies below 5, but the result is a loss of statistical power, so the test may fail to detect a genuine effect. Even in that case, the exp. frequencies shouldn't be below 1.
  4. 在大型列联表中,最多可达到20%的exp。低于5的频率,但结果是统计功率的损失,因此测试可能无法检测到真正的效果。即使在那种情况下,exp。频率不应低于1。
  5. An alternative test for low exp. frequencies is Fisher's exact test. The chi-square test statistic approximates chi-square distribution. If the sample size is large, this approximation becomes more accurate, hence the requirement for exp. frequencies > 5. The Fisher's exact test computes the exact probability of the chi-square statistic even when the sample size is small, however it may be more computationally intensive. Unfortunately, for contingency tables larger than 2x2 you may need to simulate the p-values, which has it's own limitations (no space to discuss it here, but it's a good research subject). Select large number of replicates for simulation (B), and adjust it to see how robust your solution is.
  6. 低exp的替代测试。频率是Fisher的精确测试。卡方检验统计量近似于卡方分布。如果样本量很大,则这种近似变得更准确,因此需要exp。频率> 5.Fisher精确检验计算卡方统计量的准确概率,即使样本量很小,但它可能计算量更大。不幸的是,对于大于2x2的列联表,你可能需要模拟p值,这有其自身的局限性(这里没有讨论它的空间,但它是一个很好的研究课题)。为模拟(B)选择大量重复,并调整它以查看您的解决方案的稳健性。