我对超级基本REGEX代码做错了什么？

Suppose I have a CSV file (titled "Substance Related Deaths of Females in 2014") of a bunch of data that looks like (keep in mind this is a CSV file, and just a small sample, and it's also made-up data so the numbers aren't real):

假设我有一个CSV文件(标题为“2014年的女性物质相关死亡”)的一堆数据看起来像(请记住这是一个CSV文件,只是一个小样本,它也是如此组成的数据所以数字不真实):

Substance Related Deaths
of Females
by country             
2014
Country                 pregnant status     alcohol    opiates    heroin
USA                     pregnant            1,230      4,844      893
                        not pregnant        23,440     12,773     2,005
CANADA                  pregnant            1,094      735        804
                        not pregnant        18,661     5,787      1,050
GERMANY                 pregnant            444        97         203
                        not pregnant        1,007      388        1,375
MEXICO                  pregnant            786        1,456      1,532
                        not pregnant        20,562     2,645      7,594

The original CSV file contains 30 rows (including stuff we don't want at the top and bottom) and 8 columns.

原始CSV文件包含30行(包括我们在顶部和底部不需要的内容)和8列。

Now suppose I want to ONLY keep all the rows where each row starts with a country with capitalized letters (in other words, I only want the rows that list the country first, and only the "pregnant" data). Here's what I did:

现在假设我只想保留每行开头的所有行和一个大写字母的国家(换句话说,我只想要列出国家的行,而只是“怀孕”的数据)。这是我做的:

df <- readLines("substancedeaths.csv")
linesTOkeep <- grep("^[A-Z]",df)
mydata <- df[linesTOkeep]
finaltable <- read.table(textConnection(mydata),sep=",")

The original data has 10 countries, with 8 columns (first column is "State", rest are substances). The end goal is to have a data frame with 10 rows and 8 columns. But after running my code, I end up with only 8 rows and 8 columns, it's omitting the USA and CANADA rows, looking like this:

原始数据有10个国家,有8列(第一列是“州”,其余是物质)。最终目标是拥有一个包含10行和8列的数据框。但是在运行我的代码后,我最终只有8行和8列,它省略了USA和CANADA行,如下所示:

GERMANY                 pregnant            444        97         203
MEXICO                  pregnant            786        1,456      1,532

And so forth. Germany is at the top but USA and CANADA should be. Any ideas what may be happening?

等等。德国是最重要的,但美国和加拿大应该是。有什么想法可能会发生什么?

1 个解决方案

#1

How about the following:

以下怎么样:

linesTOkeep <- grep("^[[:upper:]]{3}", df)

mydata <- df[linesTOkeep]

finaltable <- as.data.frame(do.call(rbind, strsplit(mydata, split=" {2,10}")), stringsAsFactors=FALSE)

names(finaltable) <- c("Country", "pregnant_status", "alcohol", "opiates", "heroin")

The third line does the heavy lifting. You can look on the accepted answer in this post.

第三条线完成了繁重的工作。您可以在这篇文章中查看已接受的答案。

#1