I'm trying to remove "singletons" from a binary matrix. Here, singletons refers to elements that are the only "1" value in the row AND the column in which they appear. For example, given the following matrix:
我试图从二进制矩阵中删除“单身人士”。这里,单例指的是行中唯一的“1”值和它们出现的列中的元素。例如,给定以下矩阵:
> matrix(c(0,1,0,1,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,0,0,0,0,1,0,0,0,0,0,1,1), nrow=6)
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 0 1 0 0 0 0 0
[2,] 1 0 1 0 0 0 0
[3,] 0 0 0 1 0 0 0
[4,] 1 1 0 0 0 0 0
[5,] 0 0 0 0 1 1 1
[6,] 0 0 0 0 1 0 1
...I would like to remove all of row 3 (and, if possible, all of column 4), because the 1 in [3,4] is the only 1 in that row/column combination. [1,2] is fine, since there are other 1's in column [,2]; similarly, [2,3] is fine, since there are other 1's in row [2,]. Any help would be appreciated - thanks!
...我想删除第3行(如果可能的话,第4列的全部),因为[3,4]中的1是该行/列组合中的唯一1。 [1,2]很好,因为列[,2]中还有其他1个;类似地,[2,3]很好,因为行[2,]中还有其他1个。任何帮助将不胜感激 - 谢谢!
2 个解决方案
#1
3
You first want to find which rows and columns are singletons and then check if there are pairs of singletons rows and columns that share an index. Here is a short bit of code to accomplish this task:
您首先要查找哪些行和列是单例,然后检查是否存在共享索引的单例行和列对。以下是完成此任务的一小段代码:
foo <- matrix(c(0,1,0,...))
singRows <- which(rowSums(foo) == 1)
singCols <- which(colSums(foo) == 1)
singCombinations <- expand.grid(singRows, singCols)
singPairs <- singCombinations[apply(singCombinations, 1,
function(x) which(foo[x[1],] == 1) == x[2]),]
noSingFoo <- foo[-unique(singPairs[,1]), -unique(singPairs[,2])]
With many sinlgeton ros or columns you might need to make this a bit more efficient, but it does the job.
对于许多sinlgeton ros或列,您可能需要使其更有效,但它可以完成这项工作。
UPDATE: Here is the more efficient version I knew could be done. This way you loop only over the rows (or columns if desired) and not all combinations. Thus it is much more efficient for matrices with many singleton rows/columns.
更新:这是我知道可以完成的更有效的版本。这样,您只能在行(或列,如果需要)上循环,而不是所有组合。因此,对于具有许多单行/列的矩阵,它更有效。
## starting with foo and singRows as before
singPairRows <- singRows[sapply(singRows, function(singRow)
sum(foo[,foo[singRow,] == 1]) == 1)]
singPairs <- sapply(singPairRows, function(singRow)
c(singRow, which(foo[singRow,] == 1)))
noSingFoo <- foo[-singPairs[1,], -singPairs[2,]]
UPDATE 2: I have compared the two methods (mine=nonsparse and @Chris's=sparse) using the rbenchmark package. I have used a range of matrix sizes (from 10 to 1000 rows/columns; square matrices only) and levels of sparsity (from 0.1 to 5 non-zero entries per row/column). The relative level of performance is shown in the heat map below. Equal performance (log2 ratio of run times) is designated by white, faster with sparse method is red and faster with non-sparse method is blue. Note that I am not including the conversion to a sparse matrix in the performance calculation, so that will add some time to the sparse method. Just thought it was worth a little effort to see where this boundary was.
更新2:我使用rbenchmark包比较了两种方法(我的= nonsparse和@Chris's =稀疏)。我使用了一系列矩阵大小(从10到1000行/列;仅限方形矩阵)和稀疏程度(每行/每列0.1到5个非零条目)。相对性能水平显示在下面的热图中。等效性能(运行时间的log2比率)由白色指定,稀疏方法为红色更快,非稀疏方法为蓝色更快。请注意,我没有在性能计算中包含转换为稀疏矩阵,因此这将为稀疏方法添加一些时间。只是觉得值得花一点力气看看这个边界在哪里。
#2
2
cr1msonB1ade's way is a great answer. For more computationally intensive matrices (millions x millions), you can use this method:
cr1msonB1ade的方式是一个很好的答案。对于更加计算密集的矩阵(数百万x百万),您可以使用此方法:
Encode your matrix in sparse notation:
用稀疏表示法对矩阵进行编码:
DT <- structure(list(i = c(1, 2, 2, 3, 4, 4, 5, 5, 5, 6, 6), j = c(2,
1, 3, 4, 1, 2, 5, 6, 7, 5, 7), val = c(1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1)), .Names = c("i", "j", "val"), row.names = c(NA, -11L
), class = "data.frame")
Gives (0s are implicit)
给(0是隐含的)
> DT
i j val
1 1 2 1
2 2 1 1
3 2 3 1
4 3 4 1
5 4 1 1
6 4 2 1
7 5 5 1
8 5 6 1
9 5 7 1
10 6 5 1
11 6 7 1
Then we can filter using:
然后我们可以过滤使用:
DT <- data.table(DT)
DT[, rowcount := .N, by = i]
DT[, colcount := .N, by = j]
Giving:
赠送:
>DT[!(rowcount*colcount == 1)]
i j val rowcount colcount
1: 1 2 1 1 2
2: 2 1 1 2 2
3: 2 3 1 2 1
4: 4 1 1 2 2
5: 4 2 1 2 2
6: 5 5 1 3 2
7: 5 6 1 3 1
8: 5 7 1 3 2
9: 6 5 1 2 2
10: 6 7 1 2 2
(Note the (3,4) row is now missing)
(注意现在缺少(3,4)行)
#1
3
You first want to find which rows and columns are singletons and then check if there are pairs of singletons rows and columns that share an index. Here is a short bit of code to accomplish this task:
您首先要查找哪些行和列是单例,然后检查是否存在共享索引的单例行和列对。以下是完成此任务的一小段代码:
foo <- matrix(c(0,1,0,...))
singRows <- which(rowSums(foo) == 1)
singCols <- which(colSums(foo) == 1)
singCombinations <- expand.grid(singRows, singCols)
singPairs <- singCombinations[apply(singCombinations, 1,
function(x) which(foo[x[1],] == 1) == x[2]),]
noSingFoo <- foo[-unique(singPairs[,1]), -unique(singPairs[,2])]
With many sinlgeton ros or columns you might need to make this a bit more efficient, but it does the job.
对于许多sinlgeton ros或列,您可能需要使其更有效,但它可以完成这项工作。
UPDATE: Here is the more efficient version I knew could be done. This way you loop only over the rows (or columns if desired) and not all combinations. Thus it is much more efficient for matrices with many singleton rows/columns.
更新:这是我知道可以完成的更有效的版本。这样,您只能在行(或列,如果需要)上循环,而不是所有组合。因此,对于具有许多单行/列的矩阵,它更有效。
## starting with foo and singRows as before
singPairRows <- singRows[sapply(singRows, function(singRow)
sum(foo[,foo[singRow,] == 1]) == 1)]
singPairs <- sapply(singPairRows, function(singRow)
c(singRow, which(foo[singRow,] == 1)))
noSingFoo <- foo[-singPairs[1,], -singPairs[2,]]
UPDATE 2: I have compared the two methods (mine=nonsparse and @Chris's=sparse) using the rbenchmark package. I have used a range of matrix sizes (from 10 to 1000 rows/columns; square matrices only) and levels of sparsity (from 0.1 to 5 non-zero entries per row/column). The relative level of performance is shown in the heat map below. Equal performance (log2 ratio of run times) is designated by white, faster with sparse method is red and faster with non-sparse method is blue. Note that I am not including the conversion to a sparse matrix in the performance calculation, so that will add some time to the sparse method. Just thought it was worth a little effort to see where this boundary was.
更新2:我使用rbenchmark包比较了两种方法(我的= nonsparse和@Chris's =稀疏)。我使用了一系列矩阵大小(从10到1000行/列;仅限方形矩阵)和稀疏程度(每行/每列0.1到5个非零条目)。相对性能水平显示在下面的热图中。等效性能(运行时间的log2比率)由白色指定,稀疏方法为红色更快,非稀疏方法为蓝色更快。请注意,我没有在性能计算中包含转换为稀疏矩阵,因此这将为稀疏方法添加一些时间。只是觉得值得花一点力气看看这个边界在哪里。
#2
2
cr1msonB1ade's way is a great answer. For more computationally intensive matrices (millions x millions), you can use this method:
cr1msonB1ade的方式是一个很好的答案。对于更加计算密集的矩阵(数百万x百万),您可以使用此方法:
Encode your matrix in sparse notation:
用稀疏表示法对矩阵进行编码:
DT <- structure(list(i = c(1, 2, 2, 3, 4, 4, 5, 5, 5, 6, 6), j = c(2,
1, 3, 4, 1, 2, 5, 6, 7, 5, 7), val = c(1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1)), .Names = c("i", "j", "val"), row.names = c(NA, -11L
), class = "data.frame")
Gives (0s are implicit)
给(0是隐含的)
> DT
i j val
1 1 2 1
2 2 1 1
3 2 3 1
4 3 4 1
5 4 1 1
6 4 2 1
7 5 5 1
8 5 6 1
9 5 7 1
10 6 5 1
11 6 7 1
Then we can filter using:
然后我们可以过滤使用:
DT <- data.table(DT)
DT[, rowcount := .N, by = i]
DT[, colcount := .N, by = j]
Giving:
赠送:
>DT[!(rowcount*colcount == 1)]
i j val rowcount colcount
1: 1 2 1 1 2
2: 2 1 1 2 2
3: 2 3 1 2 1
4: 4 1 1 2 2
5: 4 2 1 2 2
6: 5 5 1 3 2
7: 5 6 1 3 1
8: 5 7 1 3 2
9: 6 5 1 2 2
10: 6 7 1 2 2
(Note the (3,4) row is now missing)
(注意现在缺少(3,4)行)