I have this test file.
我有这个测试文件。
[root@localhost ~]# cat f.txt "a aa" MM "bbb b" MM MM MM"b b " [root@localhost ~]#
I want to replace all space characters in the quotes, note, just in the quotes. All characters out of the quotes should not be touched. That is to say, what I want is something similar to:
我想替换引号中的所有空格字符,注意,只在引号中。引号中的所有字符都不应该被触摸。也就是说,我想要的是类似于:
"a_aa" MM "bbb__b" MM MM MM"b_b_"
Can this be implemented using sed?
这个可以用sed实现吗?
Thanks,
谢谢,
4 个解决方案
#1
8
This is an entirely non-trivial question.
这是一个非常重要的问题。
This works replacing the first space inside quotes with underscore:
这将用下划线替换引号中的第一个空格:
$ sed 's/\("[^ "]*\) \([^"]*"\)/\1_\2/g' f.txt
"a_aa" MM "bbb_ b"
MM MM
MM"b_b "
$
For this example, where there are no more than two spaces inside any of the quotes, it is tempting to simply repeat the command, but it gives an incorrect result:
在本例中,任何引号内的空格都不超过两个,简单地重复这个命令是很有诱惑力的,但它会产生错误的结果:
$ sed -e 's/\("[^ "]*\) \([^"]*"\)/\1_\2/g' \
> -e 's/\("[^ "]*\) \([^"]*"\)/\1_\2/g' f.txt
"a_aa"_ MM "bbb_ b"
MM MM
MM"b_b_"
$
If your version of sed
supports 'extended regular expressions', then this works for the sample data:
如果您的版本的sed支持“扩展正则表达式”,那么这对于示例数据是有效的:
$ sed -E \
> -e 's/^(([^"]*("[^ "]*")?)*)("[^ "]*) ([^"]*")/\1\4_\5/' \
> -e 's/^(([^"]*("[^ "]*")?)*)("[^ "]*) ([^"]*")/\1\4_\5/' \
> -e 's/^(([^"]*("[^ "]*")?)*)("[^ "]*) ([^"]*")/\1\4_\5/' \
> f.txt
"a_aa" MM "bbb__b"
MM MM
MM"b_b_"
$
You have to repeat that ghastly regex for every space within double quotes - hence three times for the first line of data.
你必须在双引号内的每一个空间重复这个可怕的正则表达式——因此,第一行数据是三次。
The regex can be explained as:
regex可以解释为:
- Starting at the beginning of a line,
- 从一行的开头开始,
- Look for sequences of 'zero or more non-quotes, optionally followed by a quote, no spaces or quotes, and a quote', the whole assembly repeated zero or more times,
- 查找“零或更多非引号的序列,可选地后跟引号、无空格或引号,以及引号”,整个程序集重复0或更多次,
- Followed by a quote, zero or more non-quotes, non-spaces, a space, and zero or more non-quotes, and a quote.
- 后面是引号、0或更多的非引号、非空格、空格、0或更多的非引号和引号。
- Replace the matched material with the leading part, the material at the start of the current quoted passage, an underscore, and the trailing material of the current quoted passage.
- 将匹配的材料替换为前导部分、当前引用段落开头的材料、下划线和当前引用段落的末尾材料。
Because of the start anchor, this has to be repeated once per blank...but sed
has a looping construct, so we can do it with:
由于开始锚点的存在,每个空格必须重复一次……但是sed有一个循环结构,所以我们可以这样做:
$ sed -E -e ':redo
> s/^(([^"]*("[^ "]*")?)*)("[^ "]*) ([^"]*")/\1\4_\5/
> t redo' f.txt
"a_aa" MM "bbb__b"
MM MM
MM"b_b_"
$
The :redo
defines a label; the s///
command is as before; the t redo
command jumps to the label if there was any substitution done since the last read of a line or jump to a label.
redo定义了一个标签;s/// /命令与以前一样;t redo命令会跳转到标签,如果自上次读取一行或跳转到标签之后有任何替换发生。
Given the discussion in the comments, there are a couple of points worth mentioning:
在评论的讨论中,有几点值得一提:
-
The
-E
option applies tosed
on MacOS X (tested 10.7.2). The corresponding option for the GNU version ofsed
is-r
(or--regex-extended
). The-E
option is consistent withgrep -E
(which also uses extended regular expressions). The 'classic Unix systems' do not support EREs withsed
(Solaris 10, AIX 6, HP-UX 11).-E选项适用于MacOS X上的sed(测试10.7.2)。sed的GNU版本对应的选项是-r(或——regex扩展)。e选项与grep -E(也使用扩展正则表达式)一致。“经典Unix系统”不支持sed的EREs (Solaris 10、AIX 6、HP-UX 11)。
-
You can replace the
?
I used (which is the only character that forces the use of an ERE instead of a BRE) with*
, and then deal with the parentheses (which require backslashes in front of them in a BRE to make them into capturing parentheses), leaving the script:你可以替换?我使用了*(这是唯一一个强制使用ERE而不是BRE的字符),然后处理圆括号(需要在圆括号前面加上反斜杠,使它们成为捕获圆括号),留下了脚本:
sed -e ':redo s/^\(\([^"]*\("[^ "]*"\)*\)*\)\("[^ "]*\) \([^"]*"\)/\1\4_\5/g t redo' f.txt
This produces the same output on the same input - I tried some slightly more complex patterns in the input:
这在相同的输入上产生相同的输出——我在输入中尝试了一些稍微复杂的模式:
"a aa" MM "bbb b" MM MM MM"b b " "c c""d d""e e" X " f "" g " "C C" "D D" "E E" x " F " " G "
This gives the output:
这使输出:
"a_aa" MM "bbb__b" MM MM MM"b_b_" "c_c""d_d""e__e" X "_f_""_g_" "C_C" "D_D" "E__E" x "_F_" "_G_"
-
Even with BRE notation,
sed
supported the\{0,1\}
notation to specify 0 or 1 occurrences of the previous RE term, so the?
version could be translated to a BRE using:即使使用了BRE符号,sed也支持\{0,1\}符号来指定0或1次重复出现的前一项,因此?版本可转换为使用:
sed -e ':redo s/^\(\([^"]*\("[^ "]*"\)\{0,1\}\)*\)\("[^ "]*\) \([^"]*"\)/\1\4_\5/g t redo' f.txt
This produces the same output as the other alternatives.
这将产生与其他替代方案相同的输出。
#2
0
A somehow unusual answer in XSLT 2.0 :
XSLT 2.0中有一个不寻常的答案:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="2.0">
<xsl:output method="text"></xsl:output>
<xsl:template name="init">
<xsl:for-each select="tokenize(unparsed-text('f.txt'),' ')">
<xsl:for-each select="tokenize(.,'"')">
<xsl:value-of select="if (position() mod 2 = 0)
then concat('"',translate(.,' ','_'),'"') else ."></xsl:value-of>
</xsl:for-each>
<xsl:text> </xsl:text>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
To test if, just get saxon.jar on sourceforge and use the following command line :
要测试是否,只要得到saxon。在sourceforge上运行jar,并使用以下命令行:
java -jar saxon9.jar -it:init regexp.xsl
The xslt file include the reference to the f.txt, the text file must be in the same directory as the xslt file. That can be easily changed by giving a parameter to the stylesheet.
xslt文件包含对f的引用。文本文件必须位于与xslt文件相同的目录中。通过向样式表提供参数,可以很容易地改变这一点。
It works in one pass.
它一次见效。
#3
0
This would be really easy if the quoted text was all on separate lines. So one approach is to split the text so you'll have that, do the easy transform, then rebuild the lines.
如果引用的文本都在不同的行上,这将非常容易。一种方法是分割文本,这样你就有了,做简单的变换,然后重建线条。
Splitting the text is easy, but we'll need to distinguish between newlines that were
拆分文本很容易,但是我们需要区分不同的新行
- already present in the file
- 已经出现在文件中
- added by us
- 增加我们的
To do that, we can end each line with a symbol indicating to which class it belongs. I'll just use 1 and 2, corresponding directly to the above. In sed, we have:
要做到这一点,我们可以用一个符号来结束每一行,该符号指示它属于哪个类。我用1和2,直接对应于上面。在对话中,我们有:
sed -e 's/$/1/' -e 's/"[^"]*"/2\n&2\n/g'
This produces:
这产生:
2
"a aa"2
MM 2
"bbb b"2
1
MM MM1
MM2
"b b "2
1
That's easy to transform, just use
这很容易转换,只需使用
sed -e '/".*"/ s/ /_/g'
giving
给
2
"a_aa"2
MM 2
"bbb__b"2
1
MM MM1
MM2
"b_b_"2
1
Finally, we need to put it back together. This is actually pretty horrible in sed, but feasible using the hold space:
最后,我们需要把它重新放在一起。这实际上在sed中非常可怕,但是使用hold空间是可行的:
sed -e '/1$/ {s/1$//;H;s/.*//;x;s/\n//g}' -e '/2$/ {s/2$//;H;d}'
(This would be a lot clearer in, e.g., awk.)
(这在awk (awk)这方面会清楚得多。)
Pipe those three steps together and you're done.
把这三个步骤串联起来,就完成了。
#4
0
These might work for you:
这些可能对你有用:
sed 's/^/\n/;:a;s/\(\n[^"]*"[^ "]*\) \([^"]*"\)\n*/\1_\2\n/;ta;s/\n//;ta;s/\n//' file
Explanation:
解释:
Prepend a \n
to the start of the line, this will be used to bump along the substitutions. Replace a single with a
_
within the "
's and whilst it's there place a \n
ready for the next round of substitutions. Having replaced all the 's, delete the
\n
and repeat. When all substitutions have occured delete the \n
delimiter.
在线的开始前加一个\n,这将被用来沿着替换进行碰撞。在“s”中替换一个带有“_”的单项,同时,在那里为下一轮的替换做好准备。替换完所有的s后,删除\n并重复。当所有替换完成后,删除\n分隔符。
or this:
或:
sed -r ':a;s/"/\n/;s/"/\n/;:b;s/(\n[^\n ]*) ([^\n]*\n)/\1_\2/g;tb;s/\n/%%%/g;ta;s/%%%/"/g' file
Explanation:
解释:
Replace the first set of ""
's with \n
's. Replace first space between newlines with _
, repeat. Replace \n
's with a unique delimiter (%%%
), repeat from the beginning. Tidy up at the end by replacing all %%%
with "
's.
用\n代替第一批“s”。用_替换换行之间的第一个空格,重复。用唯一的分隔符替换\n,从开始重复。最后用“s”替换所有%%%的值。
A third way:
第三种方法:
sed 's/"[^"]*"/\n&\n/g;$!s/$/@@@/' file |
sed '/"/y/ /_/;1{h;d};H;${x;s/\n//g;s/@@@/\n/g;p};d'
Explanation:
解释:
Surround all quoted expressions ("..."
) with newlines (\n
's). Insert an end-of-line delimiter @@@
on all but the last line. Pipe result to second sed
command. Translate all 's to
_
's for lines with a "
in them. Store every line in the hold space (HS). At end of file, swap to the HS and delete all \n
's and replace end-of-line delimiters with \n
's
用换行(\n’s)包围所有引用的表达式(“…”)。在除最后一行之外的所有行上插入一个末端分隔符@@@。管道结果到第二个sed命令。把所有的s翻译成_,以表示带有a的行。将每一行存储在货舱空间(HS)中。在文件结束时,切换到HS并删除所有\n,用\n替换行尾分隔符
lastly:
最后:
sed 's/\("[^"]*"\)/$(tr '"' ' '_'"'<<<'"'"'\1'"'"')/g;s/^/echo /' file | sh
or GNU sed:
或GNU sed:
sed 's/\("[^"]*"\)/$(tr '"' ' '_'"'<<<'"'"'\1'"'"')/g;s/^/echo /e' file
left for the reader to work out.
留给读者去计算。
#1
8
This is an entirely non-trivial question.
这是一个非常重要的问题。
This works replacing the first space inside quotes with underscore:
这将用下划线替换引号中的第一个空格:
$ sed 's/\("[^ "]*\) \([^"]*"\)/\1_\2/g' f.txt
"a_aa" MM "bbb_ b"
MM MM
MM"b_b "
$
For this example, where there are no more than two spaces inside any of the quotes, it is tempting to simply repeat the command, but it gives an incorrect result:
在本例中,任何引号内的空格都不超过两个,简单地重复这个命令是很有诱惑力的,但它会产生错误的结果:
$ sed -e 's/\("[^ "]*\) \([^"]*"\)/\1_\2/g' \
> -e 's/\("[^ "]*\) \([^"]*"\)/\1_\2/g' f.txt
"a_aa"_ MM "bbb_ b"
MM MM
MM"b_b_"
$
If your version of sed
supports 'extended regular expressions', then this works for the sample data:
如果您的版本的sed支持“扩展正则表达式”,那么这对于示例数据是有效的:
$ sed -E \
> -e 's/^(([^"]*("[^ "]*")?)*)("[^ "]*) ([^"]*")/\1\4_\5/' \
> -e 's/^(([^"]*("[^ "]*")?)*)("[^ "]*) ([^"]*")/\1\4_\5/' \
> -e 's/^(([^"]*("[^ "]*")?)*)("[^ "]*) ([^"]*")/\1\4_\5/' \
> f.txt
"a_aa" MM "bbb__b"
MM MM
MM"b_b_"
$
You have to repeat that ghastly regex for every space within double quotes - hence three times for the first line of data.
你必须在双引号内的每一个空间重复这个可怕的正则表达式——因此,第一行数据是三次。
The regex can be explained as:
regex可以解释为:
- Starting at the beginning of a line,
- 从一行的开头开始,
- Look for sequences of 'zero or more non-quotes, optionally followed by a quote, no spaces or quotes, and a quote', the whole assembly repeated zero or more times,
- 查找“零或更多非引号的序列,可选地后跟引号、无空格或引号,以及引号”,整个程序集重复0或更多次,
- Followed by a quote, zero or more non-quotes, non-spaces, a space, and zero or more non-quotes, and a quote.
- 后面是引号、0或更多的非引号、非空格、空格、0或更多的非引号和引号。
- Replace the matched material with the leading part, the material at the start of the current quoted passage, an underscore, and the trailing material of the current quoted passage.
- 将匹配的材料替换为前导部分、当前引用段落开头的材料、下划线和当前引用段落的末尾材料。
Because of the start anchor, this has to be repeated once per blank...but sed
has a looping construct, so we can do it with:
由于开始锚点的存在,每个空格必须重复一次……但是sed有一个循环结构,所以我们可以这样做:
$ sed -E -e ':redo
> s/^(([^"]*("[^ "]*")?)*)("[^ "]*) ([^"]*")/\1\4_\5/
> t redo' f.txt
"a_aa" MM "bbb__b"
MM MM
MM"b_b_"
$
The :redo
defines a label; the s///
command is as before; the t redo
command jumps to the label if there was any substitution done since the last read of a line or jump to a label.
redo定义了一个标签;s/// /命令与以前一样;t redo命令会跳转到标签,如果自上次读取一行或跳转到标签之后有任何替换发生。
Given the discussion in the comments, there are a couple of points worth mentioning:
在评论的讨论中,有几点值得一提:
-
The
-E
option applies tosed
on MacOS X (tested 10.7.2). The corresponding option for the GNU version ofsed
is-r
(or--regex-extended
). The-E
option is consistent withgrep -E
(which also uses extended regular expressions). The 'classic Unix systems' do not support EREs withsed
(Solaris 10, AIX 6, HP-UX 11).-E选项适用于MacOS X上的sed(测试10.7.2)。sed的GNU版本对应的选项是-r(或——regex扩展)。e选项与grep -E(也使用扩展正则表达式)一致。“经典Unix系统”不支持sed的EREs (Solaris 10、AIX 6、HP-UX 11)。
-
You can replace the
?
I used (which is the only character that forces the use of an ERE instead of a BRE) with*
, and then deal with the parentheses (which require backslashes in front of them in a BRE to make them into capturing parentheses), leaving the script:你可以替换?我使用了*(这是唯一一个强制使用ERE而不是BRE的字符),然后处理圆括号(需要在圆括号前面加上反斜杠,使它们成为捕获圆括号),留下了脚本:
sed -e ':redo s/^\(\([^"]*\("[^ "]*"\)*\)*\)\("[^ "]*\) \([^"]*"\)/\1\4_\5/g t redo' f.txt
This produces the same output on the same input - I tried some slightly more complex patterns in the input:
这在相同的输入上产生相同的输出——我在输入中尝试了一些稍微复杂的模式:
"a aa" MM "bbb b" MM MM MM"b b " "c c""d d""e e" X " f "" g " "C C" "D D" "E E" x " F " " G "
This gives the output:
这使输出:
"a_aa" MM "bbb__b" MM MM MM"b_b_" "c_c""d_d""e__e" X "_f_""_g_" "C_C" "D_D" "E__E" x "_F_" "_G_"
-
Even with BRE notation,
sed
supported the\{0,1\}
notation to specify 0 or 1 occurrences of the previous RE term, so the?
version could be translated to a BRE using:即使使用了BRE符号,sed也支持\{0,1\}符号来指定0或1次重复出现的前一项,因此?版本可转换为使用:
sed -e ':redo s/^\(\([^"]*\("[^ "]*"\)\{0,1\}\)*\)\("[^ "]*\) \([^"]*"\)/\1\4_\5/g t redo' f.txt
This produces the same output as the other alternatives.
这将产生与其他替代方案相同的输出。
#2
0
A somehow unusual answer in XSLT 2.0 :
XSLT 2.0中有一个不寻常的答案:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="2.0">
<xsl:output method="text"></xsl:output>
<xsl:template name="init">
<xsl:for-each select="tokenize(unparsed-text('f.txt'),' ')">
<xsl:for-each select="tokenize(.,'"')">
<xsl:value-of select="if (position() mod 2 = 0)
then concat('"',translate(.,' ','_'),'"') else ."></xsl:value-of>
</xsl:for-each>
<xsl:text> </xsl:text>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
To test if, just get saxon.jar on sourceforge and use the following command line :
要测试是否,只要得到saxon。在sourceforge上运行jar,并使用以下命令行:
java -jar saxon9.jar -it:init regexp.xsl
The xslt file include the reference to the f.txt, the text file must be in the same directory as the xslt file. That can be easily changed by giving a parameter to the stylesheet.
xslt文件包含对f的引用。文本文件必须位于与xslt文件相同的目录中。通过向样式表提供参数,可以很容易地改变这一点。
It works in one pass.
它一次见效。
#3
0
This would be really easy if the quoted text was all on separate lines. So one approach is to split the text so you'll have that, do the easy transform, then rebuild the lines.
如果引用的文本都在不同的行上,这将非常容易。一种方法是分割文本,这样你就有了,做简单的变换,然后重建线条。
Splitting the text is easy, but we'll need to distinguish between newlines that were
拆分文本很容易,但是我们需要区分不同的新行
- already present in the file
- 已经出现在文件中
- added by us
- 增加我们的
To do that, we can end each line with a symbol indicating to which class it belongs. I'll just use 1 and 2, corresponding directly to the above. In sed, we have:
要做到这一点,我们可以用一个符号来结束每一行,该符号指示它属于哪个类。我用1和2,直接对应于上面。在对话中,我们有:
sed -e 's/$/1/' -e 's/"[^"]*"/2\n&2\n/g'
This produces:
这产生:
2
"a aa"2
MM 2
"bbb b"2
1
MM MM1
MM2
"b b "2
1
That's easy to transform, just use
这很容易转换,只需使用
sed -e '/".*"/ s/ /_/g'
giving
给
2
"a_aa"2
MM 2
"bbb__b"2
1
MM MM1
MM2
"b_b_"2
1
Finally, we need to put it back together. This is actually pretty horrible in sed, but feasible using the hold space:
最后,我们需要把它重新放在一起。这实际上在sed中非常可怕,但是使用hold空间是可行的:
sed -e '/1$/ {s/1$//;H;s/.*//;x;s/\n//g}' -e '/2$/ {s/2$//;H;d}'
(This would be a lot clearer in, e.g., awk.)
(这在awk (awk)这方面会清楚得多。)
Pipe those three steps together and you're done.
把这三个步骤串联起来,就完成了。
#4
0
These might work for you:
这些可能对你有用:
sed 's/^/\n/;:a;s/\(\n[^"]*"[^ "]*\) \([^"]*"\)\n*/\1_\2\n/;ta;s/\n//;ta;s/\n//' file
Explanation:
解释:
Prepend a \n
to the start of the line, this will be used to bump along the substitutions. Replace a single with a
_
within the "
's and whilst it's there place a \n
ready for the next round of substitutions. Having replaced all the 's, delete the
\n
and repeat. When all substitutions have occured delete the \n
delimiter.
在线的开始前加一个\n,这将被用来沿着替换进行碰撞。在“s”中替换一个带有“_”的单项,同时,在那里为下一轮的替换做好准备。替换完所有的s后,删除\n并重复。当所有替换完成后,删除\n分隔符。
or this:
或:
sed -r ':a;s/"/\n/;s/"/\n/;:b;s/(\n[^\n ]*) ([^\n]*\n)/\1_\2/g;tb;s/\n/%%%/g;ta;s/%%%/"/g' file
Explanation:
解释:
Replace the first set of ""
's with \n
's. Replace first space between newlines with _
, repeat. Replace \n
's with a unique delimiter (%%%
), repeat from the beginning. Tidy up at the end by replacing all %%%
with "
's.
用\n代替第一批“s”。用_替换换行之间的第一个空格,重复。用唯一的分隔符替换\n,从开始重复。最后用“s”替换所有%%%的值。
A third way:
第三种方法:
sed 's/"[^"]*"/\n&\n/g;$!s/$/@@@/' file |
sed '/"/y/ /_/;1{h;d};H;${x;s/\n//g;s/@@@/\n/g;p};d'
Explanation:
解释:
Surround all quoted expressions ("..."
) with newlines (\n
's). Insert an end-of-line delimiter @@@
on all but the last line. Pipe result to second sed
command. Translate all 's to
_
's for lines with a "
in them. Store every line in the hold space (HS). At end of file, swap to the HS and delete all \n
's and replace end-of-line delimiters with \n
's
用换行(\n’s)包围所有引用的表达式(“…”)。在除最后一行之外的所有行上插入一个末端分隔符@@@。管道结果到第二个sed命令。把所有的s翻译成_,以表示带有a的行。将每一行存储在货舱空间(HS)中。在文件结束时,切换到HS并删除所有\n,用\n替换行尾分隔符
lastly:
最后:
sed 's/\("[^"]*"\)/$(tr '"' ' '_'"'<<<'"'"'\1'"'"')/g;s/^/echo /' file | sh
or GNU sed:
或GNU sed:
sed 's/\("[^"]*"\)/$(tr '"' ' '_'"'<<<'"'"'\1'"'"')/g;s/^/echo /e' file
left for the reader to work out.
留给读者去计算。