I have this CSV dataset and I need to create a function to perform data cleaning but still not working and I am running out of idea.
我有这个CSV数据集,我需要创建一个函数来执行数据清理,但仍然无法正常工作,我的想法已经用完了。
Here is the dataset on Google Drive.
以下是Google云端硬盘上的数据集。
Here is what I need to do:
这是我需要做的:
- Correcting possible typos
- Removing irrelevant data (only houses in Auckland and Wellington are considered)
- Removing outliers, e.g. negative area, negative power consumptions, very high areas, very high power consumptions
纠正可能的拼写错误
删除不相关的数据(仅考虑奥克兰和惠灵顿的房屋)
删除异常值,例如负面积,负功耗,非常高的面积,非常高的功率消耗
So far this is the code I have done:
到目前为止,这是我所做的代码:
# Reading data set
installed.packages("lubridate")
library(lubridate)
# Reading data set
power <- read.csv("data set 6.csv", na.strings="")
# SUBSETTING
Area <- as.numeric(power$Area)
City <- as.character(power$City)
P.Winter <- as.numeric(power$P.Winter)
P.Summer <- as.numeric(power$P.Summer)
#Data Cleaning
levels(power$City) <- c(levels(power$City), "Auckland")
power$City[power$City == "Ackland"] <- "Auckland"
#Removing irrelevant data (only houses in Auckland and Wellington are considered)
power$City <- power$City[-c(496,499), ]
After I run this code, the misspelled words ("Ackland") does not change to Auckland as I expected. This highlighted row as shown in this image is supposed to change to Auckland:
运行此代码后,拼写错误的单词(“Ackland”)不会像我预期的那样改为奥克兰。此图中显示的突出显示的行应该更改为奥克兰:
1 个解决方案
#1
2
To address your issue collapsing factor levels 'Ackland' and 'Auckland' (and also assuming you want power$City to be/remain a factor):
为了解决你的问题崩溃因素水平'Ackland'和'奥克兰'(并且还假设你想要力量$ City是/仍然是一个因素):
One method is to pass the levels() function a named list, each name being the correct labels of the desired levels (in your case the correct names of the cities in your data set) see: Cleaning up factor levels (collapsing multiple levels/labels) for a general example.
一种方法是将levels()函数传递给一个命名列表,每个名称都是所需级别的正确标签(在您的情况下,数据集中城市的正确名称)请参阅:清理因子级别(折叠多个级别/标签)作为一般例子。
However, just as a heads up, watch for the extra space behind the Ackland and Auckland character strings in your data set:
然而,就像抬头一样,请注意数据集中Ackland和Auckland字符串背后的额外空间:
# first view classes to confirm power$City is a factor
> apply(power, class) # --> or is.factor(power$City) will work to
Area City P.Winter P.Summer
"numeric" "factor" "numeric" "numeric"
# Notice spaces behind "Ackland " and "Auckland "
> levels(power$City)
[1] "Ackland " "Auckland " "Sydney" "Wellington"
Passing a named list to levels() works once you account for the spaces:
将一个命名列表传递给levels(),只要您考虑这些空格即可:
levels(power$City) <- list(Auckland = c("Ackland ", "Auckland "), Sydney = c("Sydney"), Wellington = c("Wellington"))
# Now only three factor levels (notice this also took care of the extra spaces)
> levels(power$City)
[1] "Auckland" "Sydney" "Wellington"
You now have 3 levels instead of 4, notice this also took care of the spaces in the level labels
您现在有3个级别而不是4个级别,请注意这也处理级别标签中的空格
Subset to include only relevant cities
子集仅包含相关城市
subpower <- power[which(power$City == c("Auckland", "Wellington")), ]
You could also subset to exclude negative values, extreme values, etc...
你也可以通过子集来排除负值,极值等...
Note: My only real contribution here is catching the extra spaces, to tackle similar problems myself I found Aaron's answer very helpful. Hope this helps!
注意:我唯一真正的贡献就是抓住额外的空间,自己解决类似的问题,我发现Aaron的回答非常有帮助。希望这可以帮助!
#1
2
To address your issue collapsing factor levels 'Ackland' and 'Auckland' (and also assuming you want power$City to be/remain a factor):
为了解决你的问题崩溃因素水平'Ackland'和'奥克兰'(并且还假设你想要力量$ City是/仍然是一个因素):
One method is to pass the levels() function a named list, each name being the correct labels of the desired levels (in your case the correct names of the cities in your data set) see: Cleaning up factor levels (collapsing multiple levels/labels) for a general example.
一种方法是将levels()函数传递给一个命名列表,每个名称都是所需级别的正确标签(在您的情况下,数据集中城市的正确名称)请参阅:清理因子级别(折叠多个级别/标签)作为一般例子。
However, just as a heads up, watch for the extra space behind the Ackland and Auckland character strings in your data set:
然而,就像抬头一样,请注意数据集中Ackland和Auckland字符串背后的额外空间:
# first view classes to confirm power$City is a factor
> apply(power, class) # --> or is.factor(power$City) will work to
Area City P.Winter P.Summer
"numeric" "factor" "numeric" "numeric"
# Notice spaces behind "Ackland " and "Auckland "
> levels(power$City)
[1] "Ackland " "Auckland " "Sydney" "Wellington"
Passing a named list to levels() works once you account for the spaces:
将一个命名列表传递给levels(),只要您考虑这些空格即可:
levels(power$City) <- list(Auckland = c("Ackland ", "Auckland "), Sydney = c("Sydney"), Wellington = c("Wellington"))
# Now only three factor levels (notice this also took care of the extra spaces)
> levels(power$City)
[1] "Auckland" "Sydney" "Wellington"
You now have 3 levels instead of 4, notice this also took care of the spaces in the level labels
您现在有3个级别而不是4个级别,请注意这也处理级别标签中的空格
Subset to include only relevant cities
子集仅包含相关城市
subpower <- power[which(power$City == c("Auckland", "Wellington")), ]
You could also subset to exclude negative values, extreme values, etc...
你也可以通过子集来排除负值,极值等...
Note: My only real contribution here is catching the extra spaces, to tackle similar problems myself I found Aaron's answer very helpful. Hope this helps!
注意:我唯一真正的贡献就是抓住额外的空间,自己解决类似的问题,我发现Aaron的回答非常有帮助。希望这可以帮助!