I asked a question here which is very difficult to tackle how can I group based on similarity in strings. I found a great idea and I want to give it a try.
我在这里问了一个很难解决的问题,我怎样才能根据字符串的相似性进行分组。我发现了一个好主意,我想尝试一下。
Here is my thought and data (the same data as that question)
这是我的想法和数据(与该问题相同的数据)
df <-structure(list(label = structure(c(5L, 6L, 7L, 8L, 3L, 1L, 2L,
9L, 10L, 4L), .Label = c(" holand", " holandindia", " Holandnorway",
" USAargentinabrazil", "Afghanestan ", "Afghanestankabol", "Afghanestankabolindia",
"indiaAfghanestan ", "USA", "USAargentina "), class = "factor"),
value = structure(c(5L, 4L, 1L, 9L, 7L, 10L, 6L, 3L, 2L,
8L), .Label = c("1941029507", "2367321518", "2849255881",
"2913128511", "2927576083", "4550996370", "457707181.9",
"637943892.6", "796495286.2", "89291651.19"), class = "factor")), .Names = c("label",
"value"), class = "data.frame", row.names = c(NA, -10L))
1- I try to calculate the number of letter per each string in each row 2- I try to perform a adist
between each pair
1-我尝试计算每行中每个字符串的字母数量2-我试图在每一对之间执行一个adist
if the output of adist
is similar to the 1, they belong to one group, if no they are in two different group
如果adistis的输出类似于1,则它们属于一个组,如果不是,则它们属于两个不同的组
to solve the above question, I need to know how to perform adjst
on all string of the first column of my data.
要解决上面的问题,我需要知道如何在我的数据的第一列中执行adjston所有字符串。
so my question is the following
所以我的问题如下
1- is there a function that does opposite to adjst ? 2- how can I perform adjst across all combination (for one time based on the longest to shortest , for example,
1-是否有与adjst相反的功能? 2-如何在所有组合中执行调整(例如,基于最长到最短的一次)
adist("Afghanestankabolindia","Afghanestan")
adist("Afghanestankabolindia","Afghanestankabol")
adist("Afghanestankabolindia","indiaAfghanestan")
adist("Afghanestankabolindia","Holandnorway")
adist("Afghanestankabolindia","holand")
adist("Afghanestankabolindia","holandindia")
.
.
.
The tricky part is that it should happen once between the reference and the other one for example, it should calculate only once the distance between
棘手的部分是它应该在参考和另一个之间发生一次,例如,它应该只计算一次之间的距离
Afghanestankabolindia and Afghanestan
and not
Afghanestan and Afghanestankabolindia
Means the reference is always the longest string
意味着引用始终是最长的字符串
1 个解决方案
#1
0
Not really sure what your expected output format is, but I think this does what you want:
不确定您的预期输出格式是什么,但我认为这可以满足您的需求:
ref = as.character(df$label)
all_combs = as.data.frame(t(combn(ref[order(nchar(ref),decreasing = T)],2)))
all_combs$val = mapply(adist,all_combs$V1,all_combs$V2)
First, we create all combinations (sort the ref
vector so first element is always the longer one (i.e. the reference). Then we use mapply to calculate adist
for all combinations.
首先,我们创建所有组合(对ref向量进行排序,因此第一个元素总是较长的一个(即参考)。然后我们使用mapply计算所有组合的adist。
Output:
V1 V2 val
1 Afghanestankabolindia USAargentinabrazil 15
2 Afghanestankabolindia indiaAfghanestan 15
3 Afghanestankabolindia Afghanestankabol 5
4 Afghanestankabolindia Holandnorway 17
5 Afghanestankabolindia USAargentina 17
6 Afghanestankabolindia Afghanestan 10
7 Afghanestankabolindia holandindia 13
8 Afghanestankabolindia holand 16
9 Afghanestankabolindia USA 21
10 USAargentinabrazil indiaAfghanestan 16
11 USAargentinabrazil Afghanestankabol 13
12 USAargentinabrazil Holandnorway 14
13 USAargentinabrazil USAargentina 7
14 USAargentinabrazil Afghanestan 15
15 USAargentinabrazil holandindia 13
16 USAargentinabrazil holand 16
17 USAargentinabrazil USA 16
18 indiaAfghanestan Afghanestankabol 10
19 indiaAfghanestan Holandnorway 14
... ..... ..... ..
Hope this helps!
希望这可以帮助!
#1
0
Not really sure what your expected output format is, but I think this does what you want:
不确定您的预期输出格式是什么,但我认为这可以满足您的需求:
ref = as.character(df$label)
all_combs = as.data.frame(t(combn(ref[order(nchar(ref),decreasing = T)],2)))
all_combs$val = mapply(adist,all_combs$V1,all_combs$V2)
First, we create all combinations (sort the ref
vector so first element is always the longer one (i.e. the reference). Then we use mapply to calculate adist
for all combinations.
首先,我们创建所有组合(对ref向量进行排序,因此第一个元素总是较长的一个(即参考)。然后我们使用mapply计算所有组合的adist。
Output:
V1 V2 val
1 Afghanestankabolindia USAargentinabrazil 15
2 Afghanestankabolindia indiaAfghanestan 15
3 Afghanestankabolindia Afghanestankabol 5
4 Afghanestankabolindia Holandnorway 17
5 Afghanestankabolindia USAargentina 17
6 Afghanestankabolindia Afghanestan 10
7 Afghanestankabolindia holandindia 13
8 Afghanestankabolindia holand 16
9 Afghanestankabolindia USA 21
10 USAargentinabrazil indiaAfghanestan 16
11 USAargentinabrazil Afghanestankabol 13
12 USAargentinabrazil Holandnorway 14
13 USAargentinabrazil USAargentina 7
14 USAargentinabrazil Afghanestan 15
15 USAargentinabrazil holandindia 13
16 USAargentinabrazil holand 16
17 USAargentinabrazil USA 16
18 indiaAfghanestan Afghanestankabol 10
19 indiaAfghanestan Holandnorway 14
... ..... ..... ..
Hope this helps!
希望这可以帮助!