数据清理和表中拼写错误的单词

I have this CSV dataset and I need to create a function to perform data cleaning but still not working and I am running out of idea.

我有这个CSV数据集,我需要创建一个函数来执行数据清理,但仍然无法正常工作,我的想法已经用完了。

Here is the dataset on Google Drive.

以下是Google云端硬盘上的数据集。

Here is what I need to do:

这是我需要做的:

Correcting possible typos

纠正可能的拼写错误

Removing irrelevant data (only houses in Auckland and Wellington are considered)

删除不相关的数据(仅考虑奥克兰和惠灵顿的房屋)

Removing outliers, e.g. negative area, negative power consumptions, very high areas, very high power consumptions

删除异常值,例如负面积,负功耗,非常高的面积,非常高的功率消耗

So far this is the code I have done:

到目前为止,这是我所做的代码:

# Reading data set
installed.packages("lubridate")
library(lubridate)

# Reading data set
power <- read.csv("data set 6.csv", na.strings="")

# SUBSETTING
Area <- as.numeric(power$Area)
City <- as.character(power$City)
P.Winter <- as.numeric(power$P.Winter)
P.Summer <- as.numeric(power$P.Summer)

#Data Cleaning
levels(power$City) <- c(levels(power$City), "Auckland")
power$City[power$City == "Ackland"] <- "Auckland"

#Removing irrelevant data (only houses in Auckland and Wellington are considered)
power$City <- power$City[-c(496,499), ]

After I run this code, the misspelled words ("Ackland") does not change to Auckland as I expected. This highlighted row as shown in this image is supposed to change to Auckland:

运行此代码后,拼写错误的单词(“Ackland”)不会像我预期的那样改为奥克兰。此图中显示的突出显示的行应该更改为奥克兰:

1 个解决方案

#1

To address your issue collapsing factor levels 'Ackland' and 'Auckland' (and also assuming you want power$City to be/remain a factor):

为了解决你的问题崩溃因素水平'Ackland'和'奥克兰'(并且还假设你想要力量$ City是/仍然是一个因素):

One method is to pass the levels() function a named list, each name being the correct labels of the desired levels (in your case the correct names of the cities in your data set) see: Cleaning up factor levels (collapsing multiple levels/labels) for a general example.

一种方法是将levels()函数传递给一个命名列表,每个名称都是所需级别的正确标签(在您的情况下,数据集中城市的正确名称)请参阅:清理因子级别(折叠多个级别/标签)作为一般例子。

However, just as a heads up, watch for the extra space behind the Ackland and Auckland character strings in your data set:

然而,就像抬头一样,请注意数据集中Ackland和Auckland字符串背后的额外空间:

    # first view classes to confirm power$City is a factor
     > apply(power, class)     # --> or is.factor(power$City) will work to
        Area      City  P.Winter  P.Summer 
    "numeric"  "factor" "numeric" "numeric" 

    # Notice spaces behind "Ackland " and "Auckland "
     > levels(power$City)
    [1] "Ackland "   "Auckland "  "Sydney"     "Wellington"

Passing a named list to levels() works once you account for the spaces:

将一个命名列表传递给levels(),只要您考虑这些空格即可:

    levels(power$City) <-  list(Auckland = c("Ackland ", "Auckland "), Sydney = c("Sydney"), Wellington = c("Wellington"))

    # Now only three factor levels (notice this also took care of the extra spaces)
      > levels(power$City)
     [1] "Auckland"   "Sydney"     "Wellington"

You now have 3 levels instead of 4, notice this also took care of the spaces in the level labels

您现在有3个级别而不是4个级别,请注意这也处理级别标签中的空格

Subset to include only relevant cities

子集仅包含相关城市

       subpower <- power[which(power$City == c("Auckland", "Wellington")), ]

You could also subset to exclude negative values, extreme values, etc...

你也可以通过子集来排除负值,极值等...

Note: My only real contribution here is catching the extra spaces, to tackle similar problems myself I found Aaron's answer very helpful. Hope this helps!

注意:我唯一真正的贡献就是抓住额外的空间,自己解决类似的问题,我发现Aaron的回答非常有帮助。希望这可以帮助!

#1