I have a csv Document with 2 columns which contains Commodity Category and Commodity Name.
我有一个包含两个列的csv文档,其中包含商品类别和商品名称。
Ex:
例:
Sl.No. Commodity Category Commodity Name
1 Stationary Pencil
2 Stationary Pen
3 Stationary Marker
4 Office Utensils Chair
5 Office Utensils Drawer
6 Hardware Monitor
7 Hardware CPU
and I have another csv file which contains various Commodity names.
我还有一个csv文件,它包含各种商品名。
Ex:
例:
Sl.No. Commodity Name
1 Pancil
2 Pencil-HB 02
3 Pencil-Apsara
4 Pancil-Nataraj
5 Pen-Parker
6 Pen-Reynolds
7 Monitor-X001RL
The output I would like is to standardise and categorise the commodity names and classify them into respective Commodity Categories like shown below :
我想要的输出是对商品名称进行标准化和分类,并将其分类为如下所示的商品类别:
Sl.No. Commodity Name Commodity Category
1 Pencil Stationary
2 Pencil Stationary
3 Pencil Stationary
4 Pancil Stationary
5 Pen Stationary
6 Pen Stationary
7 Monitor Hardware
Step 1) I first have to use NLTK (Text mining methods) and clean the data so as to seperate "Pencil" from "Pencil-HB 02" .
第一步)首先我需要使用NLTK(文本挖掘方法)来清理数据,将“Pencil”与“Pencil- hb02”分开。
Step 2) After cleaning I have to use Approximate String match technique i.e agrep() to match the patterns "Pencil *" or correcting "Pancil" to "Pencil".
清洗后,我必须使用近似的字符串匹配技术I。e agrep()匹配模式“铅笔*”或更正“Pancil”到“铅笔”。
Step 3)Once correcting the pattern I have to categorise. No idea how.
第3步)一旦纠正了模式,我就必须进行分类。不知道如何。
This is what I have thought about. I started with step 2 and I'm stuck in step 2 only. I'm not finding an exact method to code this. Is there any way to get the output as required? If yes please suggest me the method I can proceed with.
这就是我所想的。我从第2步开始,只停留在第2步。我找不到一个准确的方法来编码这个。是否有办法得到所需的输出?如果可以,请建议我可以进行的方法。
2 个解决方案
#1
3
You could use the stringdist
package. The correct
function below will correct the Commodity.Name
in file2 based on distances of the item to different CName
.
您可以使用stringdist包。以下正确的功能将会修正商品。根据项目到不同CName的距离在file2中命名。
Then a left_join
is used to join the two tables.
然后使用left_join来连接两个表。
I also notice that there are some classifications if I use the default options for stringdistmatrix
. You can try changing the weight
argument of stringdistmatrix
for better correction result.
我还注意到,如果我使用stringdistmatrix的默认选项,会有一些分类。您可以尝试更改stringdistmatrix的权值参数,以获得更好的校正结果。
> library(dplyr)
> library(stringdist)
>
> file1 <- read.csv("/Users/Randy/Desktop/file1.csv")
> file2 <- read.csv("/Users/Randy/Desktop/file2.csv")
>
> head(file1)
Sl.No. Commodity.Category Commodity.Name
1 1 Stationary Pencil
2 2 Stationary Pen
3 3 Stationary Marker
4 4 Office Utensils Chair
5 5 Office Utensils Drawer
6 6 Hardware Monitor
> head(file2)
Sl.No. Commodity.Name
1 1 Pancil
2 2 Pencil-HB 02
3 3 Pencil-Apsara
4 4 Pancil-Nataraj
5 5 Pen-Parker
6 6 Pen-Reynolds
>
> CName <- levels(file1$Commodity.Name)
> correct <- function(x){
+ factor(sapply(x, function(z) CName[which.min(stringdistmatrix(z, CName, weight=c(1,0.1,1,1)))]), CName)
+ }
>
> correctedfile2 <- file2 %>%
+ transmute(Commodity.Name.Old = Commodity.Name, Commodity.Name = correct(Commodity.Name))
>
> correctedfile2 %>%
+ inner_join(file1[,-1], by="Commodity.Name")
Commodity.Name.Old Commodity.Name Commodity.Category
1 Pancil Pencil Stationary
2 Pencil-HB 02 Pencil Stationary
3 Pencil-Apsara Pencil Stationary
4 Pancil-Nataraj Pencil Stationary
5 Pen-Parker Pen Stationary
6 Pen-Reynolds Pen Stationary
7 Monitor-X001RL Monitor Hardware
If you need the "Others" category, you just need to play with the weights. I added a row "Diesel" in file2. Then compute the score using stringdist
with customized weights (you should try varying the values). If the score is large than 2 (this value is related to how the weights are assigned), it doesn't correct anything.
如果你需要“其他”类别,你只需要使用权重。我在文件2中添加了一行“Diesel”。然后使用stringdist和定制的权重来计算分数(你应该尝试改变值)。如果分数大于2(这个值与权重分配的方式有关),则不会纠正任何错误。
PS: as we don't know all the possible labels, we have to do as.character
to convect factor
to character
.
PS:因为我们不知道所有可能的标签,所以我们必须做。字符对字符的对流因子。
PS2: I am also using tolower
for case insensitive scoring.
PS2:我也在使用tolower来进行不区分大小写的评分。
> head(file2)
Sl.No. Commodity.Name
1 1 Diesel
2 2 Pancil
3 3 Pencil-HB 02
4 4 Pencil-Apsara
5 5 Pancil-Nataraj
6 6 Pen-Parker
>
> CName <- levels(file1$Commodity.Name)
> CName.lower <- tolower(CName)
> correct_1 <- function(x){
+ scores = stringdistmatrix(tolower(x), CName.lower, weight=c(1,0.001,1,0.5))
+ if (min(scores)>2) {
+ return(x)
+ } else {
+ return(as.character(CName[which.min(scores)]))
+ }
+ }
> correct <- function(x) {
+ sapply(as.character(x), correct_1)
+ }
>
> correctedfile2 <- file2 %>%
+ transmute(Commodity.Name.Old = Commodity.Name, Commodity.Name = correct(Commodity.Name))
>
> file1$Commodity.Name = as.character(file1$Commodity.Name)
> correctedfile2 %>%
+ left_join(file1[,-1], by="Commodity.Name")
Commodity.Name.Old Commodity.Name Commodity.Category
1 Diesel Diesel <NA>
2 Pancil Pencil Stationary
3 Pencil-HB 02 Pencil Stationary
4 Pencil-Apsara Pencil Stationary
5 Pancil-Nataraj Pencil Stationary
6 Pen-Parker Pen Stationary
7 Pen-Reynolds Pen Stationary
8 Monitor-X001RL Monitor Hardware
#2
0
There is an 'Approximate string matching' function amatch()
in {stingdist}
(at least in 0.9.4.6) that returns the most probable match from the pre-defined set of words. It has a parameter maxDist
that can be set for the maximum distance to be matched, and a nomatch
parameter that can be used for the 'other' category. Otherwise, method, weights, etc. can be set similarly to stringdistmatrix()
.
{stingdist}中有一个“近似字符串匹配”函数amatch()(至少在0.9.4.6中),它从预定义的单词集返回最可能的匹配。它有一个参数maxDist,可以设置为要匹配的最大距离,以及一个nomatch参数,可以用于“其他”类别。否则,可以像stringdistmatrix()那样设置方法、权重等。
So, your original problem can be solved like this using a tidyverse compatible solution:
所以,你最初的问题可以用一个tidyverse兼容的解决方案来解决:
library(dplyr)
library(stringdist)
# Reading the files
file1 <- readr::read_csv("file1.csv")
file2 <- readr::read_csv("file2.csv")
# Getting the commodity names in a vector
commodities <- file1 %>% distinct(`Commodity Name`) %>% pull()
# Finding the closest string match of the commodities, and joining the file containing the categories
file2 %>%
mutate(`Commodity Name` = commodities[amatch(`Commodity Name`, commodities, maxDist = 5)]) %>%
left_join(file1, by = "Commodity Name")
This will return a data frame that contains the corrected commodity name and category. If the original Commodity name
is more than 5 characters away (simplified explanation of string distance) from any of the possible commodity names, the corrected name will be NA.
这将返回一个数据框架,其中包含已修改的商品名称和类别。如果原始商品名称与任何可能的商品名称之间的距离超过5个字符(字符串距离的简化解释),则正确的名称为NA。
#1
3
You could use the stringdist
package. The correct
function below will correct the Commodity.Name
in file2 based on distances of the item to different CName
.
您可以使用stringdist包。以下正确的功能将会修正商品。根据项目到不同CName的距离在file2中命名。
Then a left_join
is used to join the two tables.
然后使用left_join来连接两个表。
I also notice that there are some classifications if I use the default options for stringdistmatrix
. You can try changing the weight
argument of stringdistmatrix
for better correction result.
我还注意到,如果我使用stringdistmatrix的默认选项,会有一些分类。您可以尝试更改stringdistmatrix的权值参数,以获得更好的校正结果。
> library(dplyr)
> library(stringdist)
>
> file1 <- read.csv("/Users/Randy/Desktop/file1.csv")
> file2 <- read.csv("/Users/Randy/Desktop/file2.csv")
>
> head(file1)
Sl.No. Commodity.Category Commodity.Name
1 1 Stationary Pencil
2 2 Stationary Pen
3 3 Stationary Marker
4 4 Office Utensils Chair
5 5 Office Utensils Drawer
6 6 Hardware Monitor
> head(file2)
Sl.No. Commodity.Name
1 1 Pancil
2 2 Pencil-HB 02
3 3 Pencil-Apsara
4 4 Pancil-Nataraj
5 5 Pen-Parker
6 6 Pen-Reynolds
>
> CName <- levels(file1$Commodity.Name)
> correct <- function(x){
+ factor(sapply(x, function(z) CName[which.min(stringdistmatrix(z, CName, weight=c(1,0.1,1,1)))]), CName)
+ }
>
> correctedfile2 <- file2 %>%
+ transmute(Commodity.Name.Old = Commodity.Name, Commodity.Name = correct(Commodity.Name))
>
> correctedfile2 %>%
+ inner_join(file1[,-1], by="Commodity.Name")
Commodity.Name.Old Commodity.Name Commodity.Category
1 Pancil Pencil Stationary
2 Pencil-HB 02 Pencil Stationary
3 Pencil-Apsara Pencil Stationary
4 Pancil-Nataraj Pencil Stationary
5 Pen-Parker Pen Stationary
6 Pen-Reynolds Pen Stationary
7 Monitor-X001RL Monitor Hardware
If you need the "Others" category, you just need to play with the weights. I added a row "Diesel" in file2. Then compute the score using stringdist
with customized weights (you should try varying the values). If the score is large than 2 (this value is related to how the weights are assigned), it doesn't correct anything.
如果你需要“其他”类别,你只需要使用权重。我在文件2中添加了一行“Diesel”。然后使用stringdist和定制的权重来计算分数(你应该尝试改变值)。如果分数大于2(这个值与权重分配的方式有关),则不会纠正任何错误。
PS: as we don't know all the possible labels, we have to do as.character
to convect factor
to character
.
PS:因为我们不知道所有可能的标签,所以我们必须做。字符对字符的对流因子。
PS2: I am also using tolower
for case insensitive scoring.
PS2:我也在使用tolower来进行不区分大小写的评分。
> head(file2)
Sl.No. Commodity.Name
1 1 Diesel
2 2 Pancil
3 3 Pencil-HB 02
4 4 Pencil-Apsara
5 5 Pancil-Nataraj
6 6 Pen-Parker
>
> CName <- levels(file1$Commodity.Name)
> CName.lower <- tolower(CName)
> correct_1 <- function(x){
+ scores = stringdistmatrix(tolower(x), CName.lower, weight=c(1,0.001,1,0.5))
+ if (min(scores)>2) {
+ return(x)
+ } else {
+ return(as.character(CName[which.min(scores)]))
+ }
+ }
> correct <- function(x) {
+ sapply(as.character(x), correct_1)
+ }
>
> correctedfile2 <- file2 %>%
+ transmute(Commodity.Name.Old = Commodity.Name, Commodity.Name = correct(Commodity.Name))
>
> file1$Commodity.Name = as.character(file1$Commodity.Name)
> correctedfile2 %>%
+ left_join(file1[,-1], by="Commodity.Name")
Commodity.Name.Old Commodity.Name Commodity.Category
1 Diesel Diesel <NA>
2 Pancil Pencil Stationary
3 Pencil-HB 02 Pencil Stationary
4 Pencil-Apsara Pencil Stationary
5 Pancil-Nataraj Pencil Stationary
6 Pen-Parker Pen Stationary
7 Pen-Reynolds Pen Stationary
8 Monitor-X001RL Monitor Hardware
#2
0
There is an 'Approximate string matching' function amatch()
in {stingdist}
(at least in 0.9.4.6) that returns the most probable match from the pre-defined set of words. It has a parameter maxDist
that can be set for the maximum distance to be matched, and a nomatch
parameter that can be used for the 'other' category. Otherwise, method, weights, etc. can be set similarly to stringdistmatrix()
.
{stingdist}中有一个“近似字符串匹配”函数amatch()(至少在0.9.4.6中),它从预定义的单词集返回最可能的匹配。它有一个参数maxDist,可以设置为要匹配的最大距离,以及一个nomatch参数,可以用于“其他”类别。否则,可以像stringdistmatrix()那样设置方法、权重等。
So, your original problem can be solved like this using a tidyverse compatible solution:
所以,你最初的问题可以用一个tidyverse兼容的解决方案来解决:
library(dplyr)
library(stringdist)
# Reading the files
file1 <- readr::read_csv("file1.csv")
file2 <- readr::read_csv("file2.csv")
# Getting the commodity names in a vector
commodities <- file1 %>% distinct(`Commodity Name`) %>% pull()
# Finding the closest string match of the commodities, and joining the file containing the categories
file2 %>%
mutate(`Commodity Name` = commodities[amatch(`Commodity Name`, commodities, maxDist = 5)]) %>%
left_join(file1, by = "Commodity Name")
This will return a data frame that contains the corrected commodity name and category. If the original Commodity name
is more than 5 characters away (simplified explanation of string distance) from any of the possible commodity names, the corrected name will be NA.
这将返回一个数据框架,其中包含已修改的商品名称和类别。如果原始商品名称与任何可能的商品名称之间的距离超过5个字符(字符串距离的简化解释),则正确的名称为NA。