在R中堆叠具有相似名称的列

I have a CSV file whose awful format I cannot change (simplified here):

我有一个CSV文件,其格式很糟糕,我无法更改(在此简化):

Inc,a_One,a_Two,a_Three,b_One,b_Two,b_Three
1,1,1.5,"5 Things",2,2.5,"10 Things"
2,5,5.5,"10 Things",6,6.5,"20 Things"
Inc,a_One,a_Two,a_Three,b_One,b_Two,b_Three
3,9,9.5,"15 Things",10,10.5,"30 Things"

My desired output is a new CSV containing:

我想要的输出是一个新的CSV,包含:

inc,label,one,two,three
1,"a",1,1.5,"5 Things"
2,"a",5,5.5,"10 Things"
3,"a",9,9.5,"15 Things"
1,"b",2,2.5,"10 Things"
2,"b",6,6.5,"20 Things"
3,"b",10,10.5,"30 Things"

Basically:

lowercase the headers

小写标题

strip off header prefixes and preserve them by adding them to a new column

剥离标题前缀并通过将它们添加到新列来保留它们

remove header repetitions in later rows

删除后续行中的标题重复

stack each column that shares the latter part of their names (e.g. a_One and b_One values should be merged into the same column).

堆叠共享其名称后半部分的每个列(例如,a_One和b_One值应合并到同一列中)。

During this process, preserve the Inc value from the original row (there may be more than one row like this in various places).

在此过程中,保留原始行的Inc值(在不同的地方可能有多个这样的行)。

With caveats:

I don't know the column names ahead of time (many files, many different columns). These need to be parsed if they are to be used as logic for stripping the repetitious header rows.

我提前不知道列名(许多文件,许多不同的列)。如果要将它们用作剥离重复标题行的逻辑,则需要对它们进行解析。

There may or may not be more than one column with properties like Inc that need to be preserved when everything gets stacked. Generally, Inc represents any column that does not have a prefix like a_ or b_. I have a regex to strip out these prefixes already.

可能存在或不存在多个具有类似Inc的属性的列,当所有内容都堆叠时需要保留。通常,Inc表示没有像a_或b_这样的前缀的任何列。我有一个正则表达式来删除这些前缀。

So far, I've accomplished this:

到目前为止,我已经完成了这个:

> wip_path <- 'C:/path/to/horrible.csv'
> rawwip <- read.csv(wip_path, header = FALSE, fill = FALSE)
> rawwip
   V1    V2    V3        V4    V5    V6        V7
1 Inc a_One a_Two   a_Three b_One b_Two   b_Three
2   1     1   1.5  5 Things     2   2.5 10 Things
3   2     5   5.5 10 Things     6   6.5 20 Things
4 Inc a_One a_Two   a_Three b_One b_Two   b_Three
5   3     9   9.5 15 Things    10  10.5 30 Things

> skips <- which(rawwip$V1==rawwip[1,1])
> skips
[1] 1 4

> filwip <- rawwip[-skips,]
> filwip
  V1 V2  V3        V4 V5   V6        V7
2  1  1 1.5  5 Things  2  2.5 10 Things
3  2  5 5.5 10 Things  6  6.5 20 Things
5  3  9 9.5 15 Things 10 10.5 30 Things

> rawwip[1,]
   V1    V2    V3      V4    V5    V6      V7
1 Inc a_One a_Two a_Three b_One b_Two b_Three

But then when I try to apply a tolower() to these strings, I get:

但是当我尝试对这些字符串应用tolower()时,我得到:

> tolower(rawwip[1,])
[1] "4" "4" "4" "4" "4" "4" "4"

And this is quite unexpected.

这是非常意外的。

So my questions are:

所以我的问题是:

1) How can I gain access to the header strings in rawwip[1,] so that I can reformat them with tolower() and other string-manipulating functions?

1)如何访问rawwip [1,]中的标题字符串,以便我可以使用tolower()和其他字符串操作函数重新格式化它们?

2) Once I've done that, what's the most effective way to stack the columns with shared names while preserving the inc value for each row?

2)一旦我这样做了,在保留每行的inc值的同时,用共享名称堆叠列的最有效方法是什么?

Bear in mind, there will be well over a thousand repetitious columns that can be filtered down to perhaps 20 shared column names. I will not know the position of each stackable column ahead of time. This needs to be determined within the script.

请记住,将有超过一千个重复列可以过滤到大约20个共享列名称。我不会提前知道每个可堆叠列的位置。这需要在脚本中确定。

2 个解决方案

#1

You can use the base reshape() function. For example with the input

您可以使用base reshape()函数。例如输入

dd<-read.csv(text='Inc,a_One,a_Two,a_Three,b_One,b_Two,b_Three
1,1,1.5,"5 Things",2,2.5,"10 Things"
2,5,5.5,"10 Things",6,6.5,"20 Things"
inc,a_one,a_two,a_three,b_one,b_two,b_three
3,9,9.5,"15 Things",10,10.5,"30 Things"')

you can do

你可以做

dx <- reshape(subset(dd, Inc!="inc"), 
    varying=Map(function(x) paste(c("a","b"), x, sep="_"), c("One","Two","Three")),
    v.names=c("One","Two","Three"),
    idvar="Inc",    
    timevar="label",
    times = c("a","b"),
    direction="long")
dx

to get

    Inc label One  Two     Three
1.a   1     a   1  1.5  5 Things
2.a   2     a   5  5.5 10 Things
3.a   3     a   9  9.5 15 Things
1.b   1     b   2  2.5 10 Things
2.b   2     b   6  6.5 20 Things
3.b   3     b  10 10.5 30 Things

Because your input data is messy (embedded headers), this creates everything as factors. You could try to convert to proper data types with

因为您的输入数据很乱(嵌入式标题),所以这会创建所有因素。您可以尝试使用转换为正确的数据类型

dx[]<-lapply(lapply(dx, as.character), type.convert)

#2

I would suggest a combination of read.mtable from my GitHub-only "SOfun" package and merged.stack from my "splitstackshape" package.

我建议将来自我的GitHub-only“SOfun”包的read.mtable与来自我的“splitstackshape”包的merged.stack结合起来。

Here's the approach. I'm assuming your data is stored in a file called "somedata.txt" in your working directory.

这是方法。我假设您的数据存储在工作目录中名为“somedata.txt”的文件中。

The packages we need:

我们需要的包裹:

library(splitstackshape) # for merged.stack
library(SOfun)           # for read.mtable

First, grab a vector of the names. While we are at it, change the name structure from "a_one" to "one_a" -- it's a much more convenient format for both merged.stack and reshape.

首先,抓住名字的向量。当我们在这里时,将名称结构从“a_one”更改为“one_a” - 这对于merged.stack和reshape来说都是一种更方便的格式。

theNames <- gsub("(.*)_(.*)", "\\2_\\1", 
                 tolower(scan(what = "", sep = ",", 
                              text = readLines("somefile.txt", n = 1))))

Second, use read.mtable to read the data in. We create the data chunks by identifying all the lines that start with letters. You can use a more specific regular expression if that doesn't match your actual data.

其次,使用read.mtable读取数据。我们通过识别以字母开头的所有行来创建数据块。如果与实际数据不匹配,则可以使用更具体的正则表达式。

This will create a list of data.frames, so we use do.call(rbind, ...) to put it together in a single data.frame:

这将创建一个data.frames列表,因此我们使用do.call(rbind,...)将它们放在一个data.frame中:

theData <- read.mtable("somefile.txt", "^[A-Za-z]", header = FALSE, sep = ",")

theData <- setNames(do.call(rbind, theData), theNames)

This is what the data now look like:

这就是数据现在的样子:

theData
#                                               inc one_a two_a   three_a one_b two_b   three_b
# Inc,a_One,a_Two,a_Three,b_One,b_Two,b_Three.1   1     1   1.5  5 Things     2   2.5 10 Things
# Inc,a_One,a_Two,a_Three,b_One,b_Two,b_Three.2   2     5   5.5 10 Things     6   6.5 20 Things
# inc,a_one,a_two,a_three,b_one,b_two,b_three     3     9   9.5 15 Things    10  10.5 30 Things

From here, you can use merged.stack from "splitstackshape"....

从这里,您可以使用“splitstackshape”中的merged.stack ....

merged.stack(theData, var.stubs = c("one", "two", "three"), sep = "_")
#    inc .time_1 one  two     three
# 1:   1       a   1  1.5  5 Things
# 2:   1       b   2  2.5 10 Things
# 3:   2       a   5  5.5 10 Things
# 4:   2       b   6  6.5 20 Things
# 5:   3       a   9  9.5 15 Things
# 6:   3       b  10 10.5 30 Things

... or reshape from base R:

...或从基地R重塑:

reshape(theData, direction = "long", idvar = "inc", 
        varying = 2:ncol(theData), sep = "_")
#     inc time one  two     three
# 1.a   1    a   1  1.5  5 Things
# 2.a   2    a   5  5.5 10 Things
# 3.a   3    a   9  9.5 15 Things
# 1.b   1    b   2  2.5 10 Things
# 2.b   2    b   6  6.5 20 Things
# 3.b   3    b  10 10.5 30 Things

#1