管理大型数据框中的重复行

I would like to tag samples (sample_id collumn) with more than one State in the same no collumn with E string.

我想在与E字符串相同的no collumn中标记具有多个State的样本（sample_id collumn）。

My df dataframe input:

我的df数据帧输入：

          no               sample_id  State 
chr1-15984544-15996851-0n  NE001788    0n
chr1-15984544-15996851-0n  NE001788    1n
chr1-15984544-15996851-0n  NE001836    0n
chr1-15984544-15996851-0n  NE002026    0n
chr1-15984544-15996851-0n  NE001413    0n
chr1-15984544-15996851-0n  NE001438    0n

My expected output:

我的预期产量：

          no               sample_id  State 
chr1-15984544-15996851-0n  NE001788    E
chr1-15984544-15996851-0n  NE001836    0n
chr1-15984544-15996851-0n  NE002026    0n
chr1-15984544-15996851-0n  NE001413    0n
chr1-15984544-15996851-0n  NE001438    0n

The sample NE001788 was tagged with E because it have two different states (State) in a same no string. I had used the below code to small dataframes:

样本NE001788标有E，因为它在同一个无字符串中有两个不同的状态（State）。我使用下面的代码来处理小数据帧：

df <- read.table(text= 'no  sample_id  State 
                 chr1-15984544-15996851-0n  NE001788    0n
                 chr1-15984544-15996851-0n  NE001788    1n
                 chr1-15984544-15996851-0n  NE001836    0n
                 chr1-15984544-15996851-0n  NE002026    0n
                 chr1-15984544-15996851-0n  NE001413    0n
                 chr1-15984544-15996851-0n  NE001438    0n',header=TRUE) 

library(plyr)
output <- unique(ddply(df,.(no,sample_id),mutate,State=if(length(unique(State))>1) {"E"} else State))

It works fine. However, I have now a large data frame (more than 700k rows). In this large dataframe I get a memory error: cannot allocate vector of size 75kb.

它工作正常。但是，我现在有一个大型数据框（超过700k行）。在这个大型数据帧中，我得到一个内存错误：无法分配大小为75kb的向量。

I am here to ask alternatives to reach the same result, without memory breakout.

我在这里要求替代方案达到相同的结果，没有内存突破。

Thank you very much.

非常感谢你。

2 个解决方案

#1

Try data.table. I didn't benchmark this code, but it should be certainly better than plyr

试试data.table。我没有对此代码进行基准测试，但它肯定比plyr更好

library(data.table)
df <- setDT(df)[, lapply(.SD, function(x) ifelse(.N > 1, "E", as.character(x))), by = c("no", "sample_id"), .SDcols = "State"]

##                           no sample_id State
## 1: chr1-15984544-15996851-0n  NE001788     E
## 2: chr1-15984544-15996851-0n  NE001836    0n
## 3: chr1-15984544-15996851-0n  NE002026    0n
## 4: chr1-15984544-15996851-0n  NE001413    0n
## 5: chr1-15984544-15996851-0n  NE001438    0n

Better option will be to first make State a character (if it's not already) in order to avoid doing as.character in each group, and then do the subsetting. Something like

更好的选择是首先使State成为一个字符（如果它还没有），以避免在每个组中执行as.character，然后进行子集化。就像是

setDT(df)[, State := as.character(State)]
df <- df[, lapply(.SD, function(x) ifelse(.N > 1, "E", x)), by = c("no", "sample_id"), .SDcols = "State"]

#2

And here's the dyplr code to do it:

这是执行它的dyplr代码：

dd %>%
  mutate(State = as.character(State)) %>%
  group_by(no, sample_id) %>%
  summarize(State = ifelse(length(unique(State)) > 1, "E", State))

Most likely, using dplyr will be faster than plyr, but I don't know how it compares in terms of memory usage, since that seems to be the bottleneck in your case.

最有可能的是，使用dplyr会比plyr更快，但我不知道它在内存使用方面的比较，因为这似乎是你的情况下的瓶颈。

Note that I convert State to character before the operation because if you read in the data from the question, it will be factors. If in reality they are characters, you can skip that of course.

请注意，我在操作之前将状态转换为字符，因为如果您从问题中读取数据，那么它将是因子。如果实际上它们是角色，你当然可以跳过它。

Note: I use length(unique(State)) > 1 to cover the (hypothetical) case where entries in no, sample_id and State are all the same in multiple rows. Based on your description you wouldn't want to assign E to State in that case, but it's not clear if such a case is possible at all in your data. If not, you could replace length(unique(State)) > 1 with n() > 1.

注意：我使用length（unique（State））> 1来覆盖（假设的）case，其中no，sample_id和State中的条目在多行中都是相同的。根据您的描述，您不希望在这种情况下将E分配给State，但不清楚您的数据中是否可以使用这种情况。如果没有，您可以用n（）> 1替换长度（唯一（状态））> 1。

#1