I've a CSV file. It contains the output of some previous R operations, so it is filled with the index numbers (such as [1], [[1]]). When it is read into R, it looks like this, for example:
我有一个CSV文件。它包含一些先前R操作的输出,因此它用索引号填充(例如[1],[[1]])。当它被读入R时,它看起来像这样,例如:
V1
1 [1] 789
2 [[1]]
3 [1] "PNG" "D115" "DX06" "Slz"
4 [1] 787
5 [[1]]
6 [1] "D010" "HC"
7 [1] 949
8 [[1]]
9 [1] "HC" "DX06"
(I don't know why all that wasted space between line number and the output data)
(我不知道为什么在行号和输出数据之间浪费了所有空间)
I need the above data to appear as follows (without [1] or [[1]] or " " and with the data placed beside its corresponding number, like):
我需要上面的数据显示如下(没有[1]或[[1]]或“”,并且数据放在相应的数字旁边,如:)
789 PNG,D115,DX06,Slz
787 D010,HC
949 HC,DX06
(possibly the 789
and its corresponding data PNG,D115,DX06,Slz
should be separated by a tab.. and like that for each row)
(可能是789及其相应的数据PNG,D115,DX06,Slz应该用一个标签分隔..并且像每行一样)
How to achieve this in R?
如何在R中实现这一目标?
2 个解决方案
#1
We could create a grouping variable ('indx'), split
the 'V1' column using the grouping index after removing the parentheses part in the beginning as well as the quotes within the string "
. Assuming that we need the first column as the numeric element, and the second column as the non-numeric part, we can use regex to replace the space with ,
(as showed in the expected result, and then rbind
the list elements.
我们可以创建一个分组变量('indx'),在删除开头的括号部分之后使用分组索引拆分'V1'列以及字符串中的引号“。假设我们需要第一列作为数字我们可以使用正则表达式替换空格,(如预期结果中所示,然后rbind列表元素),并且第二列作为非数字部分。
indx <- cumsum(c(grepl('\\[\\[', df1$V1)[-1], FALSE))
do.call(rbind,lapply(split(gsub('"|^.*\\]', '', df1$V1), indx),
function(x) data.frame(ind=x[1],
val=gsub('\\s+', ',', gsub('^\\s+|\\s+$', '',x[-1][x[-1]!=''])))))
# ind val
#1 789 PNG,D115,DX06,Slz
#2 787 D010,HC
#3 949 HC,DX06
data
df1 <- structure(list(V1 = c("[1] 789", "[[1]]",
"[1] \"PNG\" \"D115\" \"DX06\" \"Slz\"",
"[1] 787", "[[1]]", "[1] \"D010\" \"HC\"", "[1] 949",
"[[1]]", "[1] \"HC\" \"DX06\"")), .Names = "V1",
class = "data.frame", row.names = c("1", "2", "3", "4", "5", "6",
"7", "8", "9"))
#2
Honestly, a command-line fix using either sed/perl/egrep -o is less pain:
老实说,使用sed / perl / egrep -o的命令行修复不那么痛苦:
sed -e 's/.*\][ \t]*//' dirty.csv > clean.csv
#1
We could create a grouping variable ('indx'), split
the 'V1' column using the grouping index after removing the parentheses part in the beginning as well as the quotes within the string "
. Assuming that we need the first column as the numeric element, and the second column as the non-numeric part, we can use regex to replace the space with ,
(as showed in the expected result, and then rbind
the list elements.
我们可以创建一个分组变量('indx'),在删除开头的括号部分之后使用分组索引拆分'V1'列以及字符串中的引号“。假设我们需要第一列作为数字我们可以使用正则表达式替换空格,(如预期结果中所示,然后rbind列表元素),并且第二列作为非数字部分。
indx <- cumsum(c(grepl('\\[\\[', df1$V1)[-1], FALSE))
do.call(rbind,lapply(split(gsub('"|^.*\\]', '', df1$V1), indx),
function(x) data.frame(ind=x[1],
val=gsub('\\s+', ',', gsub('^\\s+|\\s+$', '',x[-1][x[-1]!=''])))))
# ind val
#1 789 PNG,D115,DX06,Slz
#2 787 D010,HC
#3 949 HC,DX06
data
df1 <- structure(list(V1 = c("[1] 789", "[[1]]",
"[1] \"PNG\" \"D115\" \"DX06\" \"Slz\"",
"[1] 787", "[[1]]", "[1] \"D010\" \"HC\"", "[1] 949",
"[[1]]", "[1] \"HC\" \"DX06\"")), .Names = "V1",
class = "data.frame", row.names = c("1", "2", "3", "4", "5", "6",
"7", "8", "9"))
#2
Honestly, a command-line fix using either sed/perl/egrep -o is less pain:
老实说,使用sed / perl / egrep -o的命令行修复不那么痛苦:
sed -e 's/.*\][ \t]*//' dirty.csv > clean.csv