如何比较矩阵列中的连续行,然后相应地更改值

时间:2022-08-04 13:01:43

I have a matrix full of 1's and 0's. The columns represent samples and the rows represent chromosomes.

我有一个1和0的矩阵。列表示样本,行表示染色体。

I would like to keep all rows that have consecutive 1's in them (ie at least two consecutive rows with a 1 in it). This has to be restricted per chromosome (so that consecutive 1's between two chromosomes is not counted).

我想保留所有连续1的行(即至少连续两行,其中包含1)。这必须限制每条染色体(因此不计算两条染色体之间的连续1')。

I would like to do this for each column in the matrix.

我想对矩阵中的每一列执行此操作。

My data is as follows:

我的数据如下:

chr       leftPos     OC_030_ST.res OC_031_WG.res
1           4324            0            1
1           23433           1            1
1           34436           1            0
1           64755           1            1
3           234             1            0
3           354             0            1
4           1666            0            1
4           4565            0            1
5           34777           1            1
7           2345            1            1
7           4567            1            1

and the output should be:

输出应该是:

chr       leftPos     OC_030_ST.res OC_031_WG.res
1           4324            0            1
1           23433           1            1
1           34436           1            0
1           64755           1            0
3           234             0            0
3           354             0            0
4           1666            0            1
4           4565            0            1
5           34777           0            0
7           2345            1            1
7           4567            1            1

I don't know how to compare consecutive rows according to chromosome. I imagine I could group by dplyr and somehow compare rows but the comparison is a bit beyond me.

我不知道如何根据染色体比较连续的行。我想我可以通过dplyr分组并以某种方式比较行,但比较有点超出我。

EDIT

编辑

Using dput actual data

使用dput实际数据

    structure(list(chr = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), leftPos = c(240000, 
1080000, 1200000, 1320000, 1440000, 1800000, 2400000, 2520000, 
3120000, 3360000, 3480000, 3600000, 3720000, 4200000, 4560000, 
4920000, 5040000, 5160000, 5280000, 6000000, 7080000, 7200000, 
7320000, 7440000, 7560000, 7680000, 7800000, 8280000, 8400000, 
8520000, 8760000, 9120000, 9720000, 9840000, 9960000, 10080000, 
10200000, 10320000, 10440000, 10560000, 10800000, 11040000, 11160000, 
11280000, 11400000, 11520000, 11760000, 11880000, 12000000, 12120000
), chr.res = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), leftPos.res = c(0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0), OC_AH_026C.res = c(0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
), OC_AH_026C.1.res = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), OC_AH_026C.2.res = c(0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0), OC_AH_084C.res = c(0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0), OC_AH_086C.res = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), OC_AH_086C.1.res = c(0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0), OC_AH_086C.2.res = c(0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0), OC_AH_086C.3.res = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0), OC_AH_088C.res = c(0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0), OC_AH_094C.res = c(0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0), OC_AH_094C.1.res = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), OC_AH_094C.2.res = c(0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0), OC_AH_094C.3.res = c(0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0), OC_AH_094C.4.res = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), OC_AH_094C.5.res = c(0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0), OC_AH_094C.6.res = c(0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0), OC_AH_094C.7.res = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), OC_AH_096C.res = c(0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0), OC_AH_100C.res = c(0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0), OC_AH_100C.1.res = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), OC_AH_127C.res = c(0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0), OC_AH_133C.res = c(0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0), OC_ED_008C.res = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), OC_ED_008C.1.res = c(0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 1, 0, 0), OC_ED_008C.2.res = c(0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 
0, 0), OC_ED_008C.3.res = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), OC_ED_016C.res = c(0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0), OC_ED_031C.res = c(0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0), OC_ED_036C.res = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), OC_GS_001C.res = c(0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0), OC_QE_062C.res = c(0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0), OC_RS_010C.res = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), OC_RS_027C.res = c(0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0), OC_RS_027C.1.res = c(0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0), OC_RS_027C.2.res = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), OC_SH_051C.res = c(0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0), OC_ST_014C.res = c(0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0), OC_ST_014C.1.res = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), OC_ST_020C.res = c(0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0), OC_ST_024C.res = c(0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0), OC_ST_033C.res = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), OC_ST_034C.res = c(0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0), OC_ST_034C.1.res = c(0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0), OC_ST_034C.2.res = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), OC_ST_035C.res = c(0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0), OC_ST_036C.res = c(0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0), OC_ST_040C.res = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), OC_WG_002C.res = c(0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0), OC_WG_005C.res = c(0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0), OC_WG_006C.res = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), OC_WG_019C.res = c(0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0), Type.res = c(NA_real_, NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), ZSSLX.10457.FastSeqA.BloodDMets_16AF_AHMMH.s_1.r_1.fq.gz.res = c(0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0), ZSSLX.10457.FastSeqB.BloodDMets_13AF_AHMMH.s_1.r_1.fq.gz.res = c(0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0), ZSSLX.10457.FastSeqC.BloodDMets_16AF_AHMMH.s_1.r_1.fq.gz.res = c(0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 1, 0, 0), ZSSLX.10457.FastSeqD.BloodDMets_27AF_AHMMH.s_1.r_1.fq.gz.res = c(0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 1, 0, 0), Means.res = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), 
    sd.res = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
    0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), count = 1:50), .Names = c("chr", 
"leftPos", "chr.res", "leftPos.res", "OC_AH_026C.res", "OC_AH_026C.1.res", 
"OC_AH_026C.2.res", "OC_AH_084C.res", "OC_AH_086C.res", "OC_AH_086C.1.res", 
"OC_AH_086C.2.res", "OC_AH_086C.3.res", "OC_AH_088C.res", "OC_AH_094C.res", 
"OC_AH_094C.1.res", "OC_AH_094C.2.res", "OC_AH_094C.3.res", "OC_AH_094C.4.res", 
"OC_AH_094C.5.res", "OC_AH_094C.6.res", "OC_AH_094C.7.res", "OC_AH_096C.res", 
"OC_AH_100C.res", "OC_AH_100C.1.res", "OC_AH_127C.res", "OC_AH_133C.res", 
"OC_ED_008C.res", "OC_ED_008C.1.res", "OC_ED_008C.2.res", "OC_ED_008C.3.res", 
"OC_ED_016C.res", "OC_ED_031C.res", "OC_ED_036C.res", "OC_GS_001C.res", 
"OC_QE_062C.res", "OC_RS_010C.res", "OC_RS_027C.res", "OC_RS_027C.1.res", 
"OC_RS_027C.2.res", "OC_SH_051C.res", "OC_ST_014C.res", "OC_ST_014C.1.res", 
"OC_ST_020C.res", "OC_ST_024C.res", "OC_ST_033C.res", "OC_ST_034C.res", 
"OC_ST_034C.1.res", "OC_ST_034C.2.res", "OC_ST_035C.res", "OC_ST_036C.res", 
"OC_ST_040C.res", "OC_WG_002C.res", "OC_WG_005C.res", "OC_WG_006C.res", 
"OC_WG_019C.res", "Type.res", "ZSSLX.10457.FastSeqA.BloodDMets_16AF_AHMMH.s_1.r_1.fq.gz.res", 
"ZSSLX.10457.FastSeqB.BloodDMets_13AF_AHMMH.s_1.r_1.fq.gz.res", 
"ZSSLX.10457.FastSeqC.BloodDMets_16AF_AHMMH.s_1.r_1.fq.gz.res", 
"ZSSLX.10457.FastSeqD.BloodDMets_27AF_AHMMH.s_1.r_1.fq.gz.res", 
"Means.res", "sd.res", "count"), row.names = c(NA, 50L), class = "data.frame") 

5 个解决方案

#1


2  

Here's a solution applying a function across chr values using the by = argument to data.table. Non-adjacent sequences are located using rle(). Should be fast too.

这是一个使用data = table的by =参数在chr值之间应用函数的解决方案。使用rle()定位非相邻序列。应该也快。

First, here is the data as I input it:

首先,这是我输入的数据:

df <- read.table(textConnection( 
"chr       leftPos     OC_030_ST.res OC_031_WG.res
1           4324            0            1
1           23433           1            1
1           34436           1            0
1           64755           1            1
3           234             1            0
3           354             0            1
4           1666            0            1
4           4565            0            1
5           34777           1            1
7           2345            1            1
7           4567            1            1"), header = TRUE)

Then the code to process the result:

然后处理结果的代码:

# function to take an integer vector and make non-consecutive 1s into 0s
convertNonRuns <- function(booleanVec) {
    rleVals <- rle(booleanVec)
    makeZeroIndex1 <- which(rleVals$lengths == 1 & rleVals$values == 1)
    makeZeroIndex2 <- sapply(makeZeroIndex1, function(x) cumsum(rleVals$lengths[1:x])[x])
    if (length(makeZeroIndex2))
        booleanVec[makeZeroIndex2] <- 0L
    as.integer(booleanVec)
}

require(data.table)
dt <- data.table(df)
# use data.table's by command to convert runs within chr(omosome)
dt[, c("OC_030_ST.res", "OC_031_WG.res") := 
     list(convertNonRuns(OC_030_ST.res), convertNonRuns(OC_031_WG.res)),
      by = chr]
dt
##     chr leftPos OC_030_ST.res OC_031_WG.res
##  1:   1    4324             0             1
##  2:   1   23433             1             1
##  3:   1   34436             1             0
##  4:   1   64755             1             0
##  5:   3     234             0             0
##  6:   3     354             0             0
##  7:   4    1666             0             1
##  8:   4    4565             0             1
##  9:   5   34777             0             0
## 10:   7    2345             1             1
## 11:   7    4567             1             1

Added

添加

For the newly added dput data, this will work:

对于新添加的dput数据,这将起作用:

# select all variables OC*.res
varnamesToChange <- names(dt)[grep("^OC.*\\.res$", names(dt))]
dt[, varnamesToChange := lapply(varnamesToChange, function(x) dt[[x]]), by = chr]

I am using data.table version 1.9.6.

我正在使用data.table版本1.9.6。

#2


1  

data.table solution, building on my initial ave solution, which is also below:

data.table解决方案,建立在我的初始平均解决方案上,也在下面:

library(data.table)
setDT(dat)
for (nam in names(dat)[3:4]) {
  dat[, 
    c(nam) := ((length((get(nam)==1)[get(nam)]) >= 2) & get(nam)==1)+0L,
    by=list(chr, cumsum(get(nam)==0))
  ]
}

#    chr leftPos OC_030_ST.res OC_031_WG.res
# 1:   1    4324             0             1
# 2:   1   23433             1             1
# 3:   1   34436             1             0
# 4:   1   64755             1             0
# 5:   3     234             0             0
# 6:   3     354             0             0
# 7:   4    1666             0             1
# 8:   4    4565             0             1
# 9:   5   34777             0             0
#10:   7    2345             1             1
#11:   7    4567             1             1

And my attempt using ave with a custom function:

我尝试使用自定义函数的ave:

fun <- function(x,grp,limit=2) { 
  runs <- ave(
    x==1,
    list(grp,cumsum(x==0)),
    FUN=function(g) length(g[g]) >= limit
  ) 
  as.numeric(runs & x==1)
}

lapply(dat[3:4], fun, grp=dat$chr)

#$OC_030_ST.res
# [1] 0 1 1 1 0 0 0 0 0 1 1
#
#$OC_031_WG.res
# [1] 1 1 0 0 0 0 1 1 0 1 1

To overwrite your original data:

要覆盖原始数据:

dat[3:4] <- lapply(dat[3:4], fun, grp=dat$chr)

#3


1  

f0(colNr,df) contains the row numbers in which the column df[,colNr] should change to 0. g(df) is the converted data frame.

f0(colNr,df)包含列df [,colNr]应更改为0的行号.g(df)是转换后的数据帧。

f0 <- function( colNr, df )
{
  col <- df[,colNr]

  n1 <- which( col == 1 )            # The `1`-rows.
  d0 <- which( diff(col) == 0 )      # Consecutive entries are equal.
  dc0 <- which( diff(df[,1]) == 0 )  # Same chromosome.

  m <- intersect( n1-1, intersect( d0, dc0 ) )

  return ( setdiff( 1:nrow(df), union(m,m+1) ) )
}

g <- function( df )
{
  for ( i in 3:ncol(df) ) { df[f0(i,df),i] <- 0 }  
  return ( df )
}

. Example 1:

。例1:

> df
   chr leftPos OC_030_ST.res OC_031_WG.res
1    1    4324             0             1
2    1   23433             1             1
3    1   34436             1             0
4    1   64755             1             1
5    3     234             1             0
6    3     354             0             1
7    4    1666             0             1
8    4    4565             0             1
9    5   34777             0             1
10   7    2345             1             1
11   7    4567             1             1
> g(df)
   chr leftPos OC_030_ST.res OC_031_WG.res
1    1    4324             0             1
2    1   23433             1             1
3    1   34436             1             0
4    1   64755             1             0
5    3     234             0             0
6    3     354             0             0
7    4    1666             0             1
8    4    4565             0             1
9    5   34777             0             0
10   7    2345             1             1
11   7    4567             1             1
> 

Example 2:

例2:

> df
   chr leftPos OC_030_ST.res OC_031_WG.res
1    1    4324             0             1
2    1   23433             1             1
3    1   34436             1             0
4    1   64755             1             1
5    3     234             1             0
6    3     354             1             1
7    4    1666             0             1
8    4    4565             1             1
9    5   34777             0             0
10   5    1234             1             0
11   7    2345             1             1
12   7    4567             1             1
> g(df)
   chr leftPos OC_030_ST.res OC_031_WG.res
1    1    4324             0             1
2    1   23433             1             1
3    1   34436             1             0
4    1   64755             1             0
5    3     234             1             0
6    3     354             1             0
7    4    1666             0             1
8    4    4565             0             1
9    5   34777             0             0
10   5    1234             0             0
11   7    2345             1             1
12   7    4567             1             1
> 

#4


0  

A simple trick can be to compare the original data set, say df, with its own copy df[-1,], which essentially takes the first row off.

一个简单的技巧可以是将原始数据集(例如df)与其自己的副本df [-1,]进行比较,该副本基本上将第一行关闭。

Comparing (columnswise) df$OC_030_ST.res == df[-1,]$OC_030_ST.res (likewise for the others) gives back a true table where each element is being compared with its next one.

比较(列式)df $ OC_030_ST.res == df [-1,] $ OC_030_ST.res(同样适用于其他)返回一个真实的表格,其中每个元素与下一个元素进行比较。

#5


0  

Perhaps you can make the next piece into a function and apply that per column per chromosome:

也许你可以把下一个片段变成一个函数,并在每个染色体的每列中应用它:

rand <- c(0,0,0,1,1,1,0,0,1,0,1,0,1,1,1,0,0,1,1,0)

first=T
keep <- vector(length=length(rand),'numeric')
for (i in 1:length(rand)){
  if (first == T){first=F;if ((rand[i] == 1) & (rand[i+1] == 1)){keep[i] <- 1}} #check if first is 1 and had neigbour 1
  else if (rand[i] == 0){keep[i] <- 0} # if 0 than keep = 0
  else if (i == length(rand)){if (rand[i-1] == 1){keep[i] <- 1}} #if last than check if 1 and neighbour is 1 than keep = 1
  else if ((rand[i-1]==1) | (rand[i+1]==1)){keep[i] <- 1} #if 1 and has neighbour 1 than keep =1
}

Output:

输出:

[1] 0 0 0 1 1 1 0 0 0 0 0 0 1 1 1 0 0 1 1 0

#1


2  

Here's a solution applying a function across chr values using the by = argument to data.table. Non-adjacent sequences are located using rle(). Should be fast too.

这是一个使用data = table的by =参数在chr值之间应用函数的解决方案。使用rle()定位非相邻序列。应该也快。

First, here is the data as I input it:

首先,这是我输入的数据:

df <- read.table(textConnection( 
"chr       leftPos     OC_030_ST.res OC_031_WG.res
1           4324            0            1
1           23433           1            1
1           34436           1            0
1           64755           1            1
3           234             1            0
3           354             0            1
4           1666            0            1
4           4565            0            1
5           34777           1            1
7           2345            1            1
7           4567            1            1"), header = TRUE)

Then the code to process the result:

然后处理结果的代码:

# function to take an integer vector and make non-consecutive 1s into 0s
convertNonRuns <- function(booleanVec) {
    rleVals <- rle(booleanVec)
    makeZeroIndex1 <- which(rleVals$lengths == 1 & rleVals$values == 1)
    makeZeroIndex2 <- sapply(makeZeroIndex1, function(x) cumsum(rleVals$lengths[1:x])[x])
    if (length(makeZeroIndex2))
        booleanVec[makeZeroIndex2] <- 0L
    as.integer(booleanVec)
}

require(data.table)
dt <- data.table(df)
# use data.table's by command to convert runs within chr(omosome)
dt[, c("OC_030_ST.res", "OC_031_WG.res") := 
     list(convertNonRuns(OC_030_ST.res), convertNonRuns(OC_031_WG.res)),
      by = chr]
dt
##     chr leftPos OC_030_ST.res OC_031_WG.res
##  1:   1    4324             0             1
##  2:   1   23433             1             1
##  3:   1   34436             1             0
##  4:   1   64755             1             0
##  5:   3     234             0             0
##  6:   3     354             0             0
##  7:   4    1666             0             1
##  8:   4    4565             0             1
##  9:   5   34777             0             0
## 10:   7    2345             1             1
## 11:   7    4567             1             1

Added

添加

For the newly added dput data, this will work:

对于新添加的dput数据,这将起作用:

# select all variables OC*.res
varnamesToChange <- names(dt)[grep("^OC.*\\.res$", names(dt))]
dt[, varnamesToChange := lapply(varnamesToChange, function(x) dt[[x]]), by = chr]

I am using data.table version 1.9.6.

我正在使用data.table版本1.9.6。

#2


1  

data.table solution, building on my initial ave solution, which is also below:

data.table解决方案,建立在我的初始平均解决方案上,也在下面:

library(data.table)
setDT(dat)
for (nam in names(dat)[3:4]) {
  dat[, 
    c(nam) := ((length((get(nam)==1)[get(nam)]) >= 2) & get(nam)==1)+0L,
    by=list(chr, cumsum(get(nam)==0))
  ]
}

#    chr leftPos OC_030_ST.res OC_031_WG.res
# 1:   1    4324             0             1
# 2:   1   23433             1             1
# 3:   1   34436             1             0
# 4:   1   64755             1             0
# 5:   3     234             0             0
# 6:   3     354             0             0
# 7:   4    1666             0             1
# 8:   4    4565             0             1
# 9:   5   34777             0             0
#10:   7    2345             1             1
#11:   7    4567             1             1

And my attempt using ave with a custom function:

我尝试使用自定义函数的ave:

fun <- function(x,grp,limit=2) { 
  runs <- ave(
    x==1,
    list(grp,cumsum(x==0)),
    FUN=function(g) length(g[g]) >= limit
  ) 
  as.numeric(runs & x==1)
}

lapply(dat[3:4], fun, grp=dat$chr)

#$OC_030_ST.res
# [1] 0 1 1 1 0 0 0 0 0 1 1
#
#$OC_031_WG.res
# [1] 1 1 0 0 0 0 1 1 0 1 1

To overwrite your original data:

要覆盖原始数据:

dat[3:4] <- lapply(dat[3:4], fun, grp=dat$chr)

#3


1  

f0(colNr,df) contains the row numbers in which the column df[,colNr] should change to 0. g(df) is the converted data frame.

f0(colNr,df)包含列df [,colNr]应更改为0的行号.g(df)是转换后的数据帧。

f0 <- function( colNr, df )
{
  col <- df[,colNr]

  n1 <- which( col == 1 )            # The `1`-rows.
  d0 <- which( diff(col) == 0 )      # Consecutive entries are equal.
  dc0 <- which( diff(df[,1]) == 0 )  # Same chromosome.

  m <- intersect( n1-1, intersect( d0, dc0 ) )

  return ( setdiff( 1:nrow(df), union(m,m+1) ) )
}

g <- function( df )
{
  for ( i in 3:ncol(df) ) { df[f0(i,df),i] <- 0 }  
  return ( df )
}

. Example 1:

。例1:

> df
   chr leftPos OC_030_ST.res OC_031_WG.res
1    1    4324             0             1
2    1   23433             1             1
3    1   34436             1             0
4    1   64755             1             1
5    3     234             1             0
6    3     354             0             1
7    4    1666             0             1
8    4    4565             0             1
9    5   34777             0             1
10   7    2345             1             1
11   7    4567             1             1
> g(df)
   chr leftPos OC_030_ST.res OC_031_WG.res
1    1    4324             0             1
2    1   23433             1             1
3    1   34436             1             0
4    1   64755             1             0
5    3     234             0             0
6    3     354             0             0
7    4    1666             0             1
8    4    4565             0             1
9    5   34777             0             0
10   7    2345             1             1
11   7    4567             1             1
> 

Example 2:

例2:

> df
   chr leftPos OC_030_ST.res OC_031_WG.res
1    1    4324             0             1
2    1   23433             1             1
3    1   34436             1             0
4    1   64755             1             1
5    3     234             1             0
6    3     354             1             1
7    4    1666             0             1
8    4    4565             1             1
9    5   34777             0             0
10   5    1234             1             0
11   7    2345             1             1
12   7    4567             1             1
> g(df)
   chr leftPos OC_030_ST.res OC_031_WG.res
1    1    4324             0             1
2    1   23433             1             1
3    1   34436             1             0
4    1   64755             1             0
5    3     234             1             0
6    3     354             1             0
7    4    1666             0             1
8    4    4565             0             1
9    5   34777             0             0
10   5    1234             0             0
11   7    2345             1             1
12   7    4567             1             1
> 

#4


0  

A simple trick can be to compare the original data set, say df, with its own copy df[-1,], which essentially takes the first row off.

一个简单的技巧可以是将原始数据集(例如df)与其自己的副本df [-1,]进行比较,该副本基本上将第一行关闭。

Comparing (columnswise) df$OC_030_ST.res == df[-1,]$OC_030_ST.res (likewise for the others) gives back a true table where each element is being compared with its next one.

比较(列式)df $ OC_030_ST.res == df [-1,] $ OC_030_ST.res(同样适用于其他)返回一个真实的表格,其中每个元素与下一个元素进行比较。

#5


0  

Perhaps you can make the next piece into a function and apply that per column per chromosome:

也许你可以把下一个片段变成一个函数,并在每个染色体的每列中应用它:

rand <- c(0,0,0,1,1,1,0,0,1,0,1,0,1,1,1,0,0,1,1,0)

first=T
keep <- vector(length=length(rand),'numeric')
for (i in 1:length(rand)){
  if (first == T){first=F;if ((rand[i] == 1) & (rand[i+1] == 1)){keep[i] <- 1}} #check if first is 1 and had neigbour 1
  else if (rand[i] == 0){keep[i] <- 0} # if 0 than keep = 0
  else if (i == length(rand)){if (rand[i-1] == 1){keep[i] <- 1}} #if last than check if 1 and neighbour is 1 than keep = 1
  else if ((rand[i-1]==1) | (rand[i+1]==1)){keep[i] <- 1} #if 1 and has neighbour 1 than keep =1
}

Output:

输出:

[1] 0 0 0 1 1 1 0 0 0 0 0 0 1 1 1 0 0 1 1 0