如何在所有对中执行一次功能

I asked a question here which is very difficult to tackle how can I group based on similarity in strings. I found a great idea and I want to give it a try.

我在这里问了一个很难解决的问题,我怎样才能根据字符串的相似性进行分组。我发现了一个好主意,我想尝试一下。

Here is my thought and data (the same data as that question)

这是我的想法和数据(与该问题相同的数据)

df <-structure(list(label = structure(c(5L, 6L, 7L, 8L, 3L, 1L, 2L, 
    9L, 10L, 4L), .Label = c(" holand", " holandindia", " Holandnorway", 
    " USAargentinabrazil", "Afghanestan ", "Afghanestankabol", "Afghanestankabolindia", 
    "indiaAfghanestan ", "USA", "USAargentina "), class = "factor"), 
        value = structure(c(5L, 4L, 1L, 9L, 7L, 10L, 6L, 3L, 2L, 
        8L), .Label = c("1941029507", "2367321518", "2849255881", 
        "2913128511", "2927576083", "4550996370", "457707181.9", 
        "637943892.6", "796495286.2", "89291651.19"), class = "factor")), .Names = c("label", 
    "value"), class = "data.frame", row.names = c(NA, -10L))

1- I try to calculate the number of letter per each string in each row 2- I try to perform a adist between each pair

1-我尝试计算每行中每个字符串的字母数量2-我试图在每一对之间执行一个adist

if the output of adistis similar to the 1, they belong to one group, if no they are in two different group

如果adistis的输出类似于1,则它们属于一个组,如果不是,则它们属于两个不同的组

to solve the above question, I need to know how to perform adjston all string of the first column of my data.

要解决上面的问题,我需要知道如何在我的数据的第一列中执行adjston所有字符串。

so my question is the following

所以我的问题如下

1- is there a function that does opposite to adjst ? 2- how can I perform adjst across all combination (for one time based on the longest to shortest , for example,

1-是否有与adjst相反的功能? 2-如何在所有组合中执行调整(例如,基于最长到最短的一次)

adist("Afghanestankabolindia","Afghanestan")
adist("Afghanestankabolindia","Afghanestankabol")
adist("Afghanestankabolindia","indiaAfghanestan")
adist("Afghanestankabolindia","Holandnorway")
adist("Afghanestankabolindia","holand")
adist("Afghanestankabolindia","holandindia")
.
.
.

The tricky part is that it should happen once between the reference and the other one for example, it should calculate only once the distance between

棘手的部分是它应该在参考和另一个之间发生一次,例如,它应该只计算一次之间的距离

Afghanestankabolindia and Afghanestan

and not

Afghanestan and Afghanestankabolindia

Means the reference is always the longest string

意味着引用始终是最长的字符串

1 个解决方案

#1

Not really sure what your expected output format is, but I think this does what you want:

不确定您的预期输出格式是什么,但我认为这可以满足您的需求:

ref = as.character(df$label)
all_combs = as.data.frame(t(combn(ref[order(nchar(ref),decreasing = T)],2)))
all_combs$val = mapply(adist,all_combs$V1,all_combs$V2)

First, we create all combinations (sort the ref vector so first element is always the longer one (i.e. the reference). Then we use mapply to calculate adist for all combinations.

首先,我们创建所有组合(对ref向量进行排序,因此第一个元素总是较长的一个(即参考)。然后我们使用mapply计算所有组合的adist。

Output:

                      V1                  V2 val
1  Afghanestankabolindia  USAargentinabrazil  15
2  Afghanestankabolindia   indiaAfghanestan   15
3  Afghanestankabolindia    Afghanestankabol   5
4  Afghanestankabolindia        Holandnorway  17
5  Afghanestankabolindia       USAargentina   17
6  Afghanestankabolindia        Afghanestan   10
7  Afghanestankabolindia         holandindia  13
8  Afghanestankabolindia              holand  16
9  Afghanestankabolindia                 USA  21
10    USAargentinabrazil   indiaAfghanestan   16
11    USAargentinabrazil    Afghanestankabol  13
12    USAargentinabrazil        Holandnorway  14
13    USAargentinabrazil       USAargentina    7
14    USAargentinabrazil        Afghanestan   15
15    USAargentinabrazil         holandindia  13
16    USAargentinabrazil              holand  16
17    USAargentinabrazil                 USA  16
18     indiaAfghanestan     Afghanestankabol  10
19     indiaAfghanestan         Holandnorway  14
...               .....                .....  ..

Hope this helps!

希望这可以帮助!

#1