I have a large data set, dedicated to biological journals, which was being composed for a long time by different people. So, the data are not in a single format. For example, in the column "AUTHOR" I can find John Smith, Smith John, Smith J and so on while it is the same person. I can not perform even the simplest actions. For example, I can't figure out what authors wrote the most articles.
我有一个大的数据集,专门用于生物期刊,它是由不同的人长期组成的。因此,数据不是单一的格式。例如,在“作者”专栏中,我可以找到约翰·史密斯、史密斯·约翰、史密斯·J等人,而他们是同一个人。即使是最简单的动作,我也做不到。例如,我不知道作者写的文章最多。
Is there any way in R to determine if the majority of symbols in the different names is the same, take them as the same elements?
在R中有没有办法确定不同名称中的大多数符号是相同的,把它们当作相同的元素?
1 个解决方案
#1
2
There are programs and packages that can help you with this, and some are listed in the comments. But, if you don't want to use these, I though I'd try to write something in R that might help you. The code will match "John Smith" with "J Smith", "John Smith", "Smith John", "John S". Meanwhile, it won't match something like "John Sally". I avoided using percentage similarity and I've talked about why a bit later on. I've put the code below:
有一些程序和程序包可以帮助您完成这些工作,还有一些是在注释中列出的。但是,如果你不想用这些,我想我可以用R写一些东西来帮助你。代码将匹配“约翰·史密斯”与“J·史密斯”、“约翰·史密斯”、“史密斯·约翰”、“约翰·S”。与此同时,它也无法与“约翰·莎莉”相匹配。我避免使用百分比相似性,稍后我已经讨论了为什么。我把代码写在下面:
# generate some random names
names = c(
"John Smith",
"Wigberht Ernust",
"Samir Henning",
"Everette Arron",
"Erik Conor",
"Smith J",
"Smith John",
"John S",
"John Sally"
);
# split those names and get all ways to write that name
split_names = lapply(
X = names,
FUN = function(x){
print(x);
# split by a space
c_split = unlist(x = strsplit(x = x, split = " "));
# get both combinations of c_split to compensate for order
c_splits = list(c_split, rev(x = c_split));
# return c_splits
c_splits;
}
)
# suppose we're looking for John Smith
search_for = "John Smith";
# split it by " " and then find all ways to write that name
search_for_split = unlist(x = strsplit(x = x, split = " "));
search_for_split = list(search_for_split, rev(x = search_for_split));
# initialise a vector containing if search_for was matched in names
match_statuses = c();
# for each name that's been split
for(i in 1:length(x = names)){
# the match status for the current name
match_status = FALSE;
# the current split name
c_split_name = split_names[[i]];
# for each element in search_for_split
for(j in 1:length(x = search_for_split)){
# the current combination of name
c_search_for_split_names = search_for_split[[j]];
# for each element in c_split_name
for(k in 1:length(x = c_split_name)){
# the current combination of current split name
c_c_split_name = c_split_name[[k]];
# if there's a match, or the length of grep (a pattern finding function is
# greater than zero)
if(
# is c_search_for_split_names first element in c_c_split_name first
# element
length(
x = grep(
pattern = c_search_for_split_names[1],
x = c_c_split_name[1]
)
) > 0 &&
# is c_search_for_split_names second element in c_c_split_name second
# element
length(
x = grep(
pattern = c_search_for_split_names[2],
x = c_c_split_name[2]
)
) > 0 ||
# or, is c_c_split_name first element in c_search_for_split_names first
# element
length(
x = grep(
pattern = c_c_split_name[1],
x = c_search_for_split_names[1]
)
) > 0 &&
# is c_c_split_name second element in c_search_for_split_names second
# element
length(
x = grep(
pattern = c_c_split_name[2],
x = c_search_for_split_names[2]
)
) > 0
){
# if this is the case, update match status to TRUE
match_status = TRUE;
} else {
# otherwise, don't update match status
}
}
}
# append match_status to the match_statuses list
match_statuses = c(match_statuses, match_status);
}
search_for;
[1] "John Smith"
cbind(names, match_statuses);
names match_statuses
[1,] "John Smith" "TRUE"
[2,] "Wigberht Ernust" "FALSE"
[3,] "Samir Henning" "FALSE"
[4,] "Everette Arron" "FALSE"
[5,] "Erik Conor" "FALSE"
[6,] "Smith J" "TRUE"
[7,] "Smith John" "TRUE"
[8,] "John S" "TRUE"
[9,] "John Sally" "FALSE"
Hopefully this code can serve as a starting point, and you may wish to adjust it to work with names of arbitrary length (instead of just two).
希望这段代码可以作为起点,您可能希望将它调整为使用任意长度的名称(而不是两个)。
Some notes:
一些注意事项:
-
I choose to avoid working with character complexity. Percentage cutoffs may lead to a a large number of false positives and false negatives. It can also be subjective. But, this is not to say that they are bad!
我选择避免使用复杂的角色。百分比截断可能导致大量的假阳性和假阴性。它也可以是主观的。但是,这并不是说他们是坏人!
-
You may wish to wrap this in a function. Then, you can dynamically apply this for different names by adjusting
search_for
.您可能希望将其封装到函数中。然后,您可以通过调整search_for来动态地将其应用于不同的名称。
-
There are some time complexity issues with this example, and depending on the size of your data, you may want/need to run in parallel, or rework it.
这个例子有一些时间复杂度的问题,根据数据的大小,您可能希望/需要并行运行,或者重新工作它。
Best wishes,
最好的祝愿,
Josh
杰克
#1
2
There are programs and packages that can help you with this, and some are listed in the comments. But, if you don't want to use these, I though I'd try to write something in R that might help you. The code will match "John Smith" with "J Smith", "John Smith", "Smith John", "John S". Meanwhile, it won't match something like "John Sally". I avoided using percentage similarity and I've talked about why a bit later on. I've put the code below:
有一些程序和程序包可以帮助您完成这些工作,还有一些是在注释中列出的。但是,如果你不想用这些,我想我可以用R写一些东西来帮助你。代码将匹配“约翰·史密斯”与“J·史密斯”、“约翰·史密斯”、“史密斯·约翰”、“约翰·S”。与此同时,它也无法与“约翰·莎莉”相匹配。我避免使用百分比相似性,稍后我已经讨论了为什么。我把代码写在下面:
# generate some random names
names = c(
"John Smith",
"Wigberht Ernust",
"Samir Henning",
"Everette Arron",
"Erik Conor",
"Smith J",
"Smith John",
"John S",
"John Sally"
);
# split those names and get all ways to write that name
split_names = lapply(
X = names,
FUN = function(x){
print(x);
# split by a space
c_split = unlist(x = strsplit(x = x, split = " "));
# get both combinations of c_split to compensate for order
c_splits = list(c_split, rev(x = c_split));
# return c_splits
c_splits;
}
)
# suppose we're looking for John Smith
search_for = "John Smith";
# split it by " " and then find all ways to write that name
search_for_split = unlist(x = strsplit(x = x, split = " "));
search_for_split = list(search_for_split, rev(x = search_for_split));
# initialise a vector containing if search_for was matched in names
match_statuses = c();
# for each name that's been split
for(i in 1:length(x = names)){
# the match status for the current name
match_status = FALSE;
# the current split name
c_split_name = split_names[[i]];
# for each element in search_for_split
for(j in 1:length(x = search_for_split)){
# the current combination of name
c_search_for_split_names = search_for_split[[j]];
# for each element in c_split_name
for(k in 1:length(x = c_split_name)){
# the current combination of current split name
c_c_split_name = c_split_name[[k]];
# if there's a match, or the length of grep (a pattern finding function is
# greater than zero)
if(
# is c_search_for_split_names first element in c_c_split_name first
# element
length(
x = grep(
pattern = c_search_for_split_names[1],
x = c_c_split_name[1]
)
) > 0 &&
# is c_search_for_split_names second element in c_c_split_name second
# element
length(
x = grep(
pattern = c_search_for_split_names[2],
x = c_c_split_name[2]
)
) > 0 ||
# or, is c_c_split_name first element in c_search_for_split_names first
# element
length(
x = grep(
pattern = c_c_split_name[1],
x = c_search_for_split_names[1]
)
) > 0 &&
# is c_c_split_name second element in c_search_for_split_names second
# element
length(
x = grep(
pattern = c_c_split_name[2],
x = c_search_for_split_names[2]
)
) > 0
){
# if this is the case, update match status to TRUE
match_status = TRUE;
} else {
# otherwise, don't update match status
}
}
}
# append match_status to the match_statuses list
match_statuses = c(match_statuses, match_status);
}
search_for;
[1] "John Smith"
cbind(names, match_statuses);
names match_statuses
[1,] "John Smith" "TRUE"
[2,] "Wigberht Ernust" "FALSE"
[3,] "Samir Henning" "FALSE"
[4,] "Everette Arron" "FALSE"
[5,] "Erik Conor" "FALSE"
[6,] "Smith J" "TRUE"
[7,] "Smith John" "TRUE"
[8,] "John S" "TRUE"
[9,] "John Sally" "FALSE"
Hopefully this code can serve as a starting point, and you may wish to adjust it to work with names of arbitrary length (instead of just two).
希望这段代码可以作为起点,您可能希望将它调整为使用任意长度的名称(而不是两个)。
Some notes:
一些注意事项:
-
I choose to avoid working with character complexity. Percentage cutoffs may lead to a a large number of false positives and false negatives. It can also be subjective. But, this is not to say that they are bad!
我选择避免使用复杂的角色。百分比截断可能导致大量的假阳性和假阴性。它也可以是主观的。但是,这并不是说他们是坏人!
-
You may wish to wrap this in a function. Then, you can dynamically apply this for different names by adjusting
search_for
.您可能希望将其封装到函数中。然后,您可以通过调整search_for来动态地将其应用于不同的名称。
-
There are some time complexity issues with this example, and depending on the size of your data, you may want/need to run in parallel, or rework it.
这个例子有一些时间复杂度的问题,根据数据的大小,您可能希望/需要并行运行,或者重新工作它。
Best wishes,
最好的祝愿,
Josh
杰克