I have a data frame which was created by importing several .csv
files and subsequently merging them all together.
我有一个数据框,它是通过导入几个.csv文件并随后将它们合并在一起而创建的。
Each of the data frames that I read in all have the column headings on row 8, with some descriptive text in the first seven rows.
我在其中读取的每个数据框都在第8行有列标题,前七行有一些描述性文本。
This is why the duplicate rows have occurred - because I cannot use the values in row 8 from the first data frame and then discard the first 8 rows from the rest of the data frames (or perhaps I can - I'm sure it's possible).
这就是出现重复行的原因 - 因为我无法使用第一个数据帧中第8行的值,然后丢弃其余数据帧中的前8行(或者我可以 - 我确信它是可能的) 。
Ultimately, what I want to happen is this:
最终,我想要发生的是:
- Read first .csv into data frame.
- Take values of row 8 to be column names
- Delete the first 8 rows.
- Read all other .csv files in, remove the first 8 rows from each one, and merge them all into the same data frame.
I am now faced with a problem where some of the rows will contain the same values as their corresponding column names.
我现在遇到一个问题,其中一些行将包含与其对应列名相同的值。
For example, the merged data frame now looks something like this:
例如,合并的数据框现在看起来像这样:
--------------------------
| Name | Age | MonthBorn |
-------------------------
| Bob | 23 | September |
| Steve| 45 | June |
| Name | Age | MonthBorn | # Should be removed
| Sue | 74 | January |
| Name | Age | MonthBorn | # Should be removed
| Tracy| 31 | February |
--------------------------
The trouble is that the combined data frame is almost 340,000 rows deep so I can't go through manually and check everything by hand. Also, I have a rough idea where each row might appear, but I can't be certain as there is a chance of variation.
问题是组合的数据框架差不多是340,000行,所以我不能手动完成并手动检查所有内容。此外,我粗略地了解每行可能出现的情况,但我无法确定是否存在变异的可能性。
How can I either check to see if the value of a row/cell matches the corresponding column name or set up the import process as outlined (bulleted) above?
如何检查行/单元格的值是否与相应的列名匹配,或者如上面概述(项目符号)设置导入过程?
3 个解决方案
#1
0
If your data frame looks approximately as follows:
如果您的数据框大致如下所示:
Df <- Data.frame(Name, Age, MonthBorn)
Then you could use an ifelse statement to test if "MonthBorn" shows up in a row.
然后,您可以使用ifelse语句来测试“MonthBorn”是否连续显示。
Df$MonthBornTest <- ifelse(Df$MonthBorn == “MonthBorn”, “True”, “False”}
Then you should be able to do this to remove the rows that contain True, effectively dropping the rows you don't want anymore.
然后,您应该能够执行此操作以删除包含True的行,从而有效地删除您不再需要的行。
Df <- Df[!(Df$MonthBornTest == “True”), ]
#2
1
We can use functions from dplyr
and tidyr
to combine the content of all columns together. After that, filter out those that are the same as the combine column names. dt2
is the final output.
我们可以使用dplyr和tidyr中的函数将所有列的内容组合在一起。之后,过滤掉与组合列名称相同的那些。 dt2是最终输出。
# Create example data
dt <- read.table(text = "Name Age MonthBorn
Bob 23 September
Steve 45 June
Bob 23 September
Name Age MonthBorn
Sue 74 January
Name Age MonthBorn
Tracy 31 February",
header = TRUE, stringsAsFactors = FALSE)
# Load package
library(dplyr)
library(tidyr)
# Process the data
dt2 <- dt %>%
unite(ColName, everything(), sep = ", ", remove = FALSE) %>%
filter(ColName != toString(colnames(dt))) %>%
select(-ColName)
dt2
Name Age MonthBorn
1 Bob 23 September
2 Steve 45 June
3 Bob 23 September
4 Sue 74 January
5 Tracy 31 February
#3
1
Your data
df <- structure(list(Name_ = c("Bob", "Steve", "Bob", "Name", "Sue",
"Name", "Tracy"), `_Age_` = c("23", "45", "23", "Age", "74",
"Age", "31"), `_MonthBorn` = c("September", "June", "September",
"MonthBorn", "January", "MonthBorn", "February")), .Names = c("Name_",
"_Age_", "_MonthBorn"), row.names = c(NA, -7L), class = c("data.table",
"data.frame"))
solution
library(stringr)
df[!sapply(1:nrow(df), function(x) all(mapply(function(x,y) str_detect(x,y), colnames(df), df[x,]))),]
Output
Name_ _Age_ _MonthBorn
1: Bob 23 September
2: Steve 45 June
3: Bob 23 September
4: Sue 74 January
5: Tracy 31 February
#1
0
If your data frame looks approximately as follows:
如果您的数据框大致如下所示:
Df <- Data.frame(Name, Age, MonthBorn)
Then you could use an ifelse statement to test if "MonthBorn" shows up in a row.
然后,您可以使用ifelse语句来测试“MonthBorn”是否连续显示。
Df$MonthBornTest <- ifelse(Df$MonthBorn == “MonthBorn”, “True”, “False”}
Then you should be able to do this to remove the rows that contain True, effectively dropping the rows you don't want anymore.
然后,您应该能够执行此操作以删除包含True的行,从而有效地删除您不再需要的行。
Df <- Df[!(Df$MonthBornTest == “True”), ]
#2
1
We can use functions from dplyr
and tidyr
to combine the content of all columns together. After that, filter out those that are the same as the combine column names. dt2
is the final output.
我们可以使用dplyr和tidyr中的函数将所有列的内容组合在一起。之后,过滤掉与组合列名称相同的那些。 dt2是最终输出。
# Create example data
dt <- read.table(text = "Name Age MonthBorn
Bob 23 September
Steve 45 June
Bob 23 September
Name Age MonthBorn
Sue 74 January
Name Age MonthBorn
Tracy 31 February",
header = TRUE, stringsAsFactors = FALSE)
# Load package
library(dplyr)
library(tidyr)
# Process the data
dt2 <- dt %>%
unite(ColName, everything(), sep = ", ", remove = FALSE) %>%
filter(ColName != toString(colnames(dt))) %>%
select(-ColName)
dt2
Name Age MonthBorn
1 Bob 23 September
2 Steve 45 June
3 Bob 23 September
4 Sue 74 January
5 Tracy 31 February
#3
1
Your data
df <- structure(list(Name_ = c("Bob", "Steve", "Bob", "Name", "Sue",
"Name", "Tracy"), `_Age_` = c("23", "45", "23", "Age", "74",
"Age", "31"), `_MonthBorn` = c("September", "June", "September",
"MonthBorn", "January", "MonthBorn", "February")), .Names = c("Name_",
"_Age_", "_MonthBorn"), row.names = c(NA, -7L), class = c("data.table",
"data.frame"))
solution
library(stringr)
df[!sapply(1:nrow(df), function(x) all(mapply(function(x,y) str_detect(x,y), colnames(df), df[x,]))),]
Output
Name_ _Age_ _MonthBorn
1: Bob 23 September
2: Steve 45 June
3: Bob 23 September
4: Sue 74 January
5: Tracy 31 February