Suppose I have a CSV file (titled "Substance Related Deaths of Females in 2014") of a bunch of data that looks like (keep in mind this is a CSV file, and just a small sample, and it's also made-up data so the numbers aren't real):
假设我有一个CSV文件(标题为“2014年的女性物质相关死亡”)的一堆数据看起来像(请记住这是一个CSV文件,只是一个小样本,它也是如此组成的数据所以数字不真实):
Substance Related Deaths
of Females
by country
2014
Country pregnant status alcohol opiates heroin
USA pregnant 1,230 4,844 893
not pregnant 23,440 12,773 2,005
CANADA pregnant 1,094 735 804
not pregnant 18,661 5,787 1,050
GERMANY pregnant 444 97 203
not pregnant 1,007 388 1,375
MEXICO pregnant 786 1,456 1,532
not pregnant 20,562 2,645 7,594
The original CSV file contains 30 rows (including stuff we don't want at the top and bottom) and 8 columns.
原始CSV文件包含30行(包括我们在顶部和底部不需要的内容)和8列。
Now suppose I want to ONLY keep all the rows where each row starts with a country with capitalized letters (in other words, I only want the rows that list the country first, and only the "pregnant" data). Here's what I did:
现在假设我只想保留每行开头的所有行和一个大写字母的国家(换句话说,我只想要列出国家的行,而只是“怀孕”的数据)。这是我做的:
df <- readLines("substancedeaths.csv")
linesTOkeep <- grep("^[A-Z]",df)
mydata <- df[linesTOkeep]
finaltable <- read.table(textConnection(mydata),sep=",")
The original data has 10 countries, with 8 columns (first column is "State", rest are substances). The end goal is to have a data frame with 10 rows and 8 columns. But after running my code, I end up with only 8 rows and 8 columns, it's omitting the USA and CANADA rows, looking like this:
原始数据有10个国家,有8列(第一列是“州”,其余是物质)。最终目标是拥有一个包含10行和8列的数据框。但是在运行我的代码后,我最终只有8行和8列,它省略了USA和CANADA行,如下所示:
GERMANY pregnant 444 97 203
MEXICO pregnant 786 1,456 1,532
And so forth. Germany is at the top but USA and CANADA should be. Any ideas what may be happening?
等等。德国是最重要的,但美国和加拿大应该是。有什么想法可能会发生什么?
1 个解决方案
#1
0
How about the following:
以下怎么样:
linesTOkeep <- grep("^[[:upper:]]{3}", df)
mydata <- df[linesTOkeep]
finaltable <- as.data.frame(do.call(rbind, strsplit(mydata, split=" {2,10}")), stringsAsFactors=FALSE)
names(finaltable) <- c("Country", "pregnant_status", "alcohol", "opiates", "heroin")
The third line does the heavy lifting. You can look on the accepted answer in this post.
第三条线完成了繁重的工作。您可以在这篇文章中查看已接受的答案。
#1
0
How about the following:
以下怎么样:
linesTOkeep <- grep("^[[:upper:]]{3}", df)
mydata <- df[linesTOkeep]
finaltable <- as.data.frame(do.call(rbind, strsplit(mydata, split=" {2,10}")), stringsAsFactors=FALSE)
names(finaltable) <- c("Country", "pregnant_status", "alcohol", "opiates", "heroin")
The third line does the heavy lifting. You can look on the accepted answer in this post.
第三条线完成了繁重的工作。您可以在这篇文章中查看已接受的答案。