I am using stream editor sed to convert a large set of text files data (400MB) into a csv format.
我正在使用流编辑器sed将大量文本文件数据(400MB)转换为csv格式。
I have come very close to finish, but the outstanding problem are quotes within quotes, on a data like this:
我已经非常接近完成,但突出的问题是引号内的引号,对于这样的数据:
1,word1,"description for word1","another text",""text contains "double quotes" some more text"
2,word2,"description for word2","another text","text may not contain double quotes, but may contain commas ,"
3,word3,"description for "word3"","another text","more text and more"
The desired output is:
所需的输出是:
1,word1,"description for word1","another text","text contains double quotes some more text"
2,word2,"description for word2","another text","text may not contain double quotes, but may contain commas ,"
3,word3,"description for word3","another text","more text and more"
I have searched around for help, but I am not getting too close to solution, I have tried the following seds with regex patterns:
我一直在寻找帮助,但我没有太接近解决方案,我尝试了以下seds与正则表达式模式:
sed -i 's/(?<!^\s*|,)""(?!,""|\s*$)//g' *.txt
sed -i 's/(?<=[^,])"(?=[^,])//g' *.txt
These are from the below questions, but do not seem to be working for sed:
这些来自以下问题,但似乎不适用于sed:
与perl相关的问题
SISS的相关问题
The original files are *.txt and I am trying to edit them in place with sed.
原始文件是* .txt,我正在尝试用sed编辑它们。
2 个解决方案
#1
2
Here's one way using GNU awk
and the FPAT variable:
这是使用GNU awk和FPAT变量的一种方法:
gawk 'BEGIN { FPAT="([^,]+)|(\"[^\"]+\")"; OFS=","; N="\"" } { for (i=1;i<=NF;i++) if ($i ~ /^\".*\"$/) { gsub(/\"/,"", $i); $i=N $i N } }1' file
Results:
1,word1,"description for word1","another text","text contains double
quotes some more text" 2,word2,"description for word2","another
text","text may not contain double quotes, but may contain commas ,"
3,word3,"description for word3","another text","more text and more"
Explanation:
Using FPAT, a field is defined as either "anything that is not a comma," or "a double quote, anything that is not a double quote, and a closing double quote". Then on every line of input, loop through each field and if the field starts and ends with a double quote, remove all quotes from the field. Finally, add double quotes surrounding the field.
使用FPAT,字段被定义为“任何不是逗号的东西”或“双引号,任何不是双引号的字段,以及结束双引号”。然后在每一行输入上循环遍历每个字段,如果字段以双引号开头和结尾,则从字段中删除所有引号。最后,在字段周围添加双引号。
#2
1
sed -e ':r s:["]\([^",]*\)["]\([^",]*\)["]\([^",]*\)["]:"\1\2\3":; tr' FILE
This looks over the strings of the type "STR1 "STR2" STR3 "
and converts them to "STR1 STR2 STR3"
. If it found something, it repeats, to be sure that it eliminates all nested strings at a depth > 2.
这将查看“STR1”STR2“STR3”类型的字符串,并将它们转换为“STR1 STR2 STR3”。如果它找到了某些东西,它会重复,以确保它消除了深度> 2的所有嵌套字符串。
It also assures that none of STRx contains comma
.
它还确保STRx都不包含逗号。
#1
2
Here's one way using GNU awk
and the FPAT variable:
这是使用GNU awk和FPAT变量的一种方法:
gawk 'BEGIN { FPAT="([^,]+)|(\"[^\"]+\")"; OFS=","; N="\"" } { for (i=1;i<=NF;i++) if ($i ~ /^\".*\"$/) { gsub(/\"/,"", $i); $i=N $i N } }1' file
Results:
1,word1,"description for word1","another text","text contains double
quotes some more text" 2,word2,"description for word2","another
text","text may not contain double quotes, but may contain commas ,"
3,word3,"description for word3","another text","more text and more"
Explanation:
Using FPAT, a field is defined as either "anything that is not a comma," or "a double quote, anything that is not a double quote, and a closing double quote". Then on every line of input, loop through each field and if the field starts and ends with a double quote, remove all quotes from the field. Finally, add double quotes surrounding the field.
使用FPAT,字段被定义为“任何不是逗号的东西”或“双引号,任何不是双引号的字段,以及结束双引号”。然后在每一行输入上循环遍历每个字段,如果字段以双引号开头和结尾,则从字段中删除所有引号。最后,在字段周围添加双引号。
#2
1
sed -e ':r s:["]\([^",]*\)["]\([^",]*\)["]\([^",]*\)["]:"\1\2\3":; tr' FILE
This looks over the strings of the type "STR1 "STR2" STR3 "
and converts them to "STR1 STR2 STR3"
. If it found something, it repeats, to be sure that it eliminates all nested strings at a depth > 2.
这将查看“STR1”STR2“STR3”类型的字符串,并将它们转换为“STR1 STR2 STR3”。如果它找到了某些东西,它会重复,以确保它消除了深度> 2的所有嵌套字符串。
It also assures that none of STRx contains comma
.
它还确保STRx都不包含逗号。