This question is the adaptation of a prior question I felt I asked in an unclear way. I am checking whether columns V1 and V2 have common codes by row. Codes are separated by a forward slash "/". The function below should take one cell from V1 and one cell from V2 on the same row and should transform them into vectors. Each element of a vector is one code. Then the function should check whether the two vectors obtained have elements in common. These elements initially are 4-digit codes. If there is any 4-digit code that matches between the two vectors, the function should return 4. If there are no elements in common, the function should reduce the number of digits of each code and then check again. Every time that the function reduces the number of digits, it also reduces the score it returns at the end. I would like the value returned by the function to be written in a column of my choice.
这个问题是对先前问题的改编,我觉得我问得不清楚。我正在逐行检查V1和V2列是否有公共代码。代码被一个正斜杠“/”分隔。下面的函数应该从V1取一个单元格,从V2取一个单元格,并将它们转换成向量。向量的每个元素都是一个代码。然后函数应该检查得到的两个向量是否有共同的元素。这些元素最初是4位数的代码。如果两个向量之间有匹配的4位数代码,函数应该返回4。如果没有共同的元素,函数应该减少每个代码的位数,然后再次检查。每当函数减少数字数时,它也会减少最后返回的分数。我希望函数返回的值写在我选择的一列中。
This is my starting condition
这是我的初始条件
structure(list(ID = c(2630611040, 2696102020, 2696526020), V1 = c("7371/3728",
"2834/2833/2836/5122/8731", "3533/3541/3545/5084"), V2 = c("7379",
"3841", "3533/3532/3531/1389/8711")), .Names = c("ID", "V1",
"V2"), class = "data.frame", row.names = c(NA, 3L))
ID V1 V2
1 2630611040 7371/3728 7379
2 2696102020 2834/2833/2836/5122/8731 3841
3 2696526020 3533/3541/3545/5084 3533/3532/3531/1389/8711
And I would like to get this
我想要这个。
ID V1 V2 V3
1 2630611040 7371/3728 7379 3
2 2696102020 2834/2833/2836/5122/8731 3841 0
3 2696526020 3533/3541/3545/5084 3533/3532/3531/1389/8711 4
My function is this
我的函数是
coderelat<-function(a, b){
a<-unique(as.integer(unlist(str_split(a, "/")))) #Transforming cells into vectors of codes
b<-unique(as.integer(unlist(str_split(b, "/"))))
a<-a[!is.na(a)]
b<-b[!is.na(b)]
if (length(a)==0 | length(b)==0) { # Check that both cells are not empty
ir=NA
return(ir)
} else {
for (i in 3:1){
diff<-intersect(a, b) # See how many products the shops have in common
if (length(diff)!=0) { #As you find a commonality, give ir the corresponding scoring
ir=i+1
break
} else if (i==1 & length(diff)==0) { #If in the last cycle, there is still no commonality put ir=0
ir=0
break
} else { # If there is no commonality and you are not in the last cycle, reduce the nr. of digits and re-check commonality again
a<- unique(as.integer(substr(as.character(a), 1, i)))
b<- unique(as.integer(substr(as.character(b), 1, i)))
}
}
}
return(ir)
}
The function works when I manually supply single cells. But it doesn't work when I write soemthing like this:
当我手工提供单个单元时,这个函数就可以工作了。但当我写这样的东西时,它不起作用:
df$V4<-coderelat(df$V1, df$V2)
I really appreciate any help because I don't know anymore how to make this work.
我真的很感激任何帮助,因为我不知道如何让这个工作。
Many thanks in advance. Riccardo
提前感谢。里卡多。
1 个解决方案
#1
3
Here's a solution using data.tables.
这里有一个使用data.tables的解决方案。
get.match <-function(a,b) {
A <- unique(strsplit(a,"/",fixed=TRUE)[[1]])
B <- unique(strsplit(b,"/",fixed=TRUE)[[1]])
for (i in 4:1) if(length(intersect(substr(A,1,i),substr(B,1,i)))>0) return(i)
return(0L)
}
library(data.table)
setDT(df)[,V3:=get.match(V1,V2),by=ID]
df
# ID V1 V2 V3
# 1: 2630611040 7371/3728 7379 3
# 2: 2696102020 2834/2833/2836/5122/8731 3841 0
# 3: 2696526020 3533/3541/3545/5084 3533/3532/3531/1389/8711 4
#1
3
Here's a solution using data.tables.
这里有一个使用data.tables的解决方案。
get.match <-function(a,b) {
A <- unique(strsplit(a,"/",fixed=TRUE)[[1]])
B <- unique(strsplit(b,"/",fixed=TRUE)[[1]])
for (i in 4:1) if(length(intersect(substr(A,1,i),substr(B,1,i)))>0) return(i)
return(0L)
}
library(data.table)
setDT(df)[,V3:=get.match(V1,V2),by=ID]
df
# ID V1 V2 V3
# 1: 2630611040 7371/3728 7379 3
# 2: 2696102020 2834/2833/2836/5122/8731 3841 0
# 3: 2696526020 3533/3541/3545/5084 3533/3532/3531/1389/8711 4