I have a CSV file whose awful format I cannot change (simplified here):
我有一个CSV文件,其格式很糟糕,我无法更改(在此简化):
Inc,a_One,a_Two,a_Three,b_One,b_Two,b_Three
1,1,1.5,"5 Things",2,2.5,"10 Things"
2,5,5.5,"10 Things",6,6.5,"20 Things"
Inc,a_One,a_Two,a_Three,b_One,b_Two,b_Three
3,9,9.5,"15 Things",10,10.5,"30 Things"
My desired output is a new CSV containing:
我想要的输出是一个新的CSV,包含:
inc,label,one,two,three
1,"a",1,1.5,"5 Things"
2,"a",5,5.5,"10 Things"
3,"a",9,9.5,"15 Things"
1,"b",2,2.5,"10 Things"
2,"b",6,6.5,"20 Things"
3,"b",10,10.5,"30 Things"
Basically:
- lowercase the headers
- strip off header prefixes and preserve them by adding them to a new column
- remove header repetitions in later rows
- stack each column that shares the latter part of their names (e.g.
a_One
andb_One
values should be merged into the same column). - During this process, preserve the
Inc
value from the original row (there may be more than one row like this in various places).
小写标题
剥离标题前缀并通过将它们添加到新列来保留它们
删除后续行中的标题重复
堆叠共享其名称后半部分的每个列(例如,a_One和b_One值应合并到同一列中)。
在此过程中,保留原始行的Inc值(在不同的地方可能有多个这样的行)。
With caveats:
- I don't know the column names ahead of time (many files, many different columns). These need to be parsed if they are to be used as logic for stripping the repetitious header rows.
- There may or may not be more than one column with properties like
Inc
that need to be preserved when everything gets stacked. Generally,Inc
represents any column that does not have a prefix likea_
orb_
. I have a regex to strip out these prefixes already.
我提前不知道列名(许多文件,许多不同的列)。如果要将它们用作剥离重复标题行的逻辑,则需要对它们进行解析。
可能存在或不存在多个具有类似Inc的属性的列,当所有内容都堆叠时需要保留。通常,Inc表示没有像a_或b_这样的前缀的任何列。我有一个正则表达式来删除这些前缀。
So far, I've accomplished this:
到目前为止,我已经完成了这个:
> wip_path <- 'C:/path/to/horrible.csv'
> rawwip <- read.csv(wip_path, header = FALSE, fill = FALSE)
> rawwip
V1 V2 V3 V4 V5 V6 V7
1 Inc a_One a_Two a_Three b_One b_Two b_Three
2 1 1 1.5 5 Things 2 2.5 10 Things
3 2 5 5.5 10 Things 6 6.5 20 Things
4 Inc a_One a_Two a_Three b_One b_Two b_Three
5 3 9 9.5 15 Things 10 10.5 30 Things
> skips <- which(rawwip$V1==rawwip[1,1])
> skips
[1] 1 4
> filwip <- rawwip[-skips,]
> filwip
V1 V2 V3 V4 V5 V6 V7
2 1 1 1.5 5 Things 2 2.5 10 Things
3 2 5 5.5 10 Things 6 6.5 20 Things
5 3 9 9.5 15 Things 10 10.5 30 Things
> rawwip[1,]
V1 V2 V3 V4 V5 V6 V7
1 Inc a_One a_Two a_Three b_One b_Two b_Three
But then when I try to apply a tolower() to these strings, I get:
但是当我尝试对这些字符串应用tolower()时,我得到:
> tolower(rawwip[1,])
[1] "4" "4" "4" "4" "4" "4" "4"
And this is quite unexpected.
这是非常意外的。
So my questions are:
所以我的问题是:
1) How can I gain access to the header strings in rawwip[1,]
so that I can reformat them with tolower()
and other string-manipulating functions?
1)如何访问rawwip [1,]中的标题字符串,以便我可以使用tolower()和其他字符串操作函数重新格式化它们?
2) Once I've done that, what's the most effective way to stack the columns with shared names while preserving the inc
value for each row?
2)一旦我这样做了,在保留每行的inc值的同时,用共享名称堆叠列的最有效方法是什么?
Bear in mind, there will be well over a thousand repetitious columns that can be filtered down to perhaps 20 shared column names. I will not know the position of each stackable column ahead of time. This needs to be determined within the script.
请记住,将有超过一千个重复列可以过滤到大约20个共享列名称。我不会提前知道每个可堆叠列的位置。这需要在脚本中确定。
2 个解决方案
#1
You can use the base reshape()
function. For example with the input
您可以使用base reshape()函数。例如输入
dd<-read.csv(text='Inc,a_One,a_Two,a_Three,b_One,b_Two,b_Three
1,1,1.5,"5 Things",2,2.5,"10 Things"
2,5,5.5,"10 Things",6,6.5,"20 Things"
inc,a_one,a_two,a_three,b_one,b_two,b_three
3,9,9.5,"15 Things",10,10.5,"30 Things"')
you can do
你可以做
dx <- reshape(subset(dd, Inc!="inc"),
varying=Map(function(x) paste(c("a","b"), x, sep="_"), c("One","Two","Three")),
v.names=c("One","Two","Three"),
idvar="Inc",
timevar="label",
times = c("a","b"),
direction="long")
dx
to get
Inc label One Two Three
1.a 1 a 1 1.5 5 Things
2.a 2 a 5 5.5 10 Things
3.a 3 a 9 9.5 15 Things
1.b 1 b 2 2.5 10 Things
2.b 2 b 6 6.5 20 Things
3.b 3 b 10 10.5 30 Things
Because your input data is messy (embedded headers), this creates everything as factors. You could try to convert to proper data types with
因为您的输入数据很乱(嵌入式标题),所以这会创建所有因素。您可以尝试使用转换为正确的数据类型
dx[]<-lapply(lapply(dx, as.character), type.convert)
#2
I would suggest a combination of read.mtable
from my GitHub-only "SOfun" package and merged.stack
from my "splitstackshape" package.
我建议将来自我的GitHub-only“SOfun”包的read.mtable与来自我的“splitstackshape”包的merged.stack结合起来。
Here's the approach. I'm assuming your data is stored in a file called "somedata.txt" in your working directory.
这是方法。我假设您的数据存储在工作目录中名为“somedata.txt”的文件中。
The packages we need:
我们需要的包裹:
library(splitstackshape) # for merged.stack
library(SOfun) # for read.mtable
First, grab a vector of the names. While we are at it, change the name structure from "a_one" to "one_a" -- it's a much more convenient format for both merged.stack
and reshape
.
首先,抓住名字的向量。当我们在这里时,将名称结构从“a_one”更改为“one_a” - 这对于merged.stack和reshape来说都是一种更方便的格式。
theNames <- gsub("(.*)_(.*)", "\\2_\\1",
tolower(scan(what = "", sep = ",",
text = readLines("somefile.txt", n = 1))))
Second, use read.mtable
to read the data in. We create the data chunks by identifying all the lines that start with letters. You can use a more specific regular expression if that doesn't match your actual data.
其次,使用read.mtable读取数据。我们通过识别以字母开头的所有行来创建数据块。如果与实际数据不匹配,则可以使用更具体的正则表达式。
This will create a list
of data.frame
s, so we use do.call(rbind, ...)
to put it together in a single data.frame
:
这将创建一个data.frames列表,因此我们使用do.call(rbind,...)将它们放在一个data.frame中:
theData <- read.mtable("somefile.txt", "^[A-Za-z]", header = FALSE, sep = ",")
theData <- setNames(do.call(rbind, theData), theNames)
This is what the data now look like:
这就是数据现在的样子:
theData
# inc one_a two_a three_a one_b two_b three_b
# Inc,a_One,a_Two,a_Three,b_One,b_Two,b_Three.1 1 1 1.5 5 Things 2 2.5 10 Things
# Inc,a_One,a_Two,a_Three,b_One,b_Two,b_Three.2 2 5 5.5 10 Things 6 6.5 20 Things
# inc,a_one,a_two,a_three,b_one,b_two,b_three 3 9 9.5 15 Things 10 10.5 30 Things
From here, you can use merged.stack
from "splitstackshape"....
从这里,您可以使用“splitstackshape”中的merged.stack ....
merged.stack(theData, var.stubs = c("one", "two", "three"), sep = "_")
# inc .time_1 one two three
# 1: 1 a 1 1.5 5 Things
# 2: 1 b 2 2.5 10 Things
# 3: 2 a 5 5.5 10 Things
# 4: 2 b 6 6.5 20 Things
# 5: 3 a 9 9.5 15 Things
# 6: 3 b 10 10.5 30 Things
... or reshape
from base R:
...或从基地R重塑:
reshape(theData, direction = "long", idvar = "inc",
varying = 2:ncol(theData), sep = "_")
# inc time one two three
# 1.a 1 a 1 1.5 5 Things
# 2.a 2 a 5 5.5 10 Things
# 3.a 3 a 9 9.5 15 Things
# 1.b 1 b 2 2.5 10 Things
# 2.b 2 b 6 6.5 20 Things
# 3.b 3 b 10 10.5 30 Things
#1
You can use the base reshape()
function. For example with the input
您可以使用base reshape()函数。例如输入
dd<-read.csv(text='Inc,a_One,a_Two,a_Three,b_One,b_Two,b_Three
1,1,1.5,"5 Things",2,2.5,"10 Things"
2,5,5.5,"10 Things",6,6.5,"20 Things"
inc,a_one,a_two,a_three,b_one,b_two,b_three
3,9,9.5,"15 Things",10,10.5,"30 Things"')
you can do
你可以做
dx <- reshape(subset(dd, Inc!="inc"),
varying=Map(function(x) paste(c("a","b"), x, sep="_"), c("One","Two","Three")),
v.names=c("One","Two","Three"),
idvar="Inc",
timevar="label",
times = c("a","b"),
direction="long")
dx
to get
Inc label One Two Three
1.a 1 a 1 1.5 5 Things
2.a 2 a 5 5.5 10 Things
3.a 3 a 9 9.5 15 Things
1.b 1 b 2 2.5 10 Things
2.b 2 b 6 6.5 20 Things
3.b 3 b 10 10.5 30 Things
Because your input data is messy (embedded headers), this creates everything as factors. You could try to convert to proper data types with
因为您的输入数据很乱(嵌入式标题),所以这会创建所有因素。您可以尝试使用转换为正确的数据类型
dx[]<-lapply(lapply(dx, as.character), type.convert)
#2
I would suggest a combination of read.mtable
from my GitHub-only "SOfun" package and merged.stack
from my "splitstackshape" package.
我建议将来自我的GitHub-only“SOfun”包的read.mtable与来自我的“splitstackshape”包的merged.stack结合起来。
Here's the approach. I'm assuming your data is stored in a file called "somedata.txt" in your working directory.
这是方法。我假设您的数据存储在工作目录中名为“somedata.txt”的文件中。
The packages we need:
我们需要的包裹:
library(splitstackshape) # for merged.stack
library(SOfun) # for read.mtable
First, grab a vector of the names. While we are at it, change the name structure from "a_one" to "one_a" -- it's a much more convenient format for both merged.stack
and reshape
.
首先,抓住名字的向量。当我们在这里时,将名称结构从“a_one”更改为“one_a” - 这对于merged.stack和reshape来说都是一种更方便的格式。
theNames <- gsub("(.*)_(.*)", "\\2_\\1",
tolower(scan(what = "", sep = ",",
text = readLines("somefile.txt", n = 1))))
Second, use read.mtable
to read the data in. We create the data chunks by identifying all the lines that start with letters. You can use a more specific regular expression if that doesn't match your actual data.
其次,使用read.mtable读取数据。我们通过识别以字母开头的所有行来创建数据块。如果与实际数据不匹配,则可以使用更具体的正则表达式。
This will create a list
of data.frame
s, so we use do.call(rbind, ...)
to put it together in a single data.frame
:
这将创建一个data.frames列表,因此我们使用do.call(rbind,...)将它们放在一个data.frame中:
theData <- read.mtable("somefile.txt", "^[A-Za-z]", header = FALSE, sep = ",")
theData <- setNames(do.call(rbind, theData), theNames)
This is what the data now look like:
这就是数据现在的样子:
theData
# inc one_a two_a three_a one_b two_b three_b
# Inc,a_One,a_Two,a_Three,b_One,b_Two,b_Three.1 1 1 1.5 5 Things 2 2.5 10 Things
# Inc,a_One,a_Two,a_Three,b_One,b_Two,b_Three.2 2 5 5.5 10 Things 6 6.5 20 Things
# inc,a_one,a_two,a_three,b_one,b_two,b_three 3 9 9.5 15 Things 10 10.5 30 Things
From here, you can use merged.stack
from "splitstackshape"....
从这里,您可以使用“splitstackshape”中的merged.stack ....
merged.stack(theData, var.stubs = c("one", "two", "three"), sep = "_")
# inc .time_1 one two three
# 1: 1 a 1 1.5 5 Things
# 2: 1 b 2 2.5 10 Things
# 3: 2 a 5 5.5 10 Things
# 4: 2 b 6 6.5 20 Things
# 5: 3 a 9 9.5 15 Things
# 6: 3 b 10 10.5 30 Things
... or reshape
from base R:
...或从基地R重塑:
reshape(theData, direction = "long", idvar = "inc",
varying = 2:ncol(theData), sep = "_")
# inc time one two three
# 1.a 1 a 1 1.5 5 Things
# 2.a 2 a 5 5.5 10 Things
# 3.a 3 a 9 9.5 15 Things
# 1.b 1 b 2 2.5 10 Things
# 2.b 2 b 6 6.5 20 Things
# 3.b 3 b 10 10.5 30 Things