I'm a beginner with regexes and I'm trying to achieve something relatively simple:
我是regexes的初学者,我正在尝试实现一些相对简单的事情:
I have a dataset arranged like this:
我有这样一个数据集:
1,AAA,aaaa,BBB,bbbbbb ...
2,AAA,aaaaaaa,BBB,bbb ...
3,AAA,aaaaa,BBB,bb ...
I'm looking into adding curly brackets to the strings of various length (alphanumeric chars) following AAA or BBB (these are constant):
我正在研究在AAA或BBB(这些都是常量)之后的不同长度(字母数字字符字符)的字符串中添加花括号:
1,AAA,{aaaa},BBB,{bbbbbb} ...
2,AAA,{aaaaaaa},BBB,{bbb} ...
3,AAA,{aaaaa},BBB,{bb} ...
So I have tried with sed this way:
所以我用sed进行了尝试:
sed 's/(AAA|BBB)[[:punct:]].[[:alnum:]]/\1{&}/g' dataset.txt
However I got this result:
但是我得到的结果是:
1,AAA,{AAA,aa}aa,BBB,{BBB,bb}bbbb, ...
2,AAA,{AAA,aa}aaaaa,BBB,[BBB,bb}b, ...
3,AAA,{AAA,aa}aaa,BBB,{BBB,bb} ...
Obvisouly, the &
in the replace part of sed
is going to be the matched pattern, however, I would like &
to be only what is after the matched patter, what am I doing wrong?
显然,sed的替换部分将是匹配的模式,但是,我想要的只是在匹配的补丁之后,我做错了什么?
I have also tried adding word boundaries, after [^ ]
to no avail. Am I trying too hard with sed
? Should I use a language that allows lookbehind instead?
我也试过添加单词边界,在[^]无济于事。我对sed是不是太努力了?我应该使用允许lookbehind的语言吗?
Thanks for any help!
感谢任何帮助!
3 个解决方案
#1
1
Try this:
试试这个:
sed 's/\(AAA\|BBB\),\([^,]*\)/\1,{\2}/g' dataset.txt
#2
1
You can always have more than 1 capture groups in your regex, to capture different parts. You can even move the [:punct:]
part inside the first capture group:
您可以在regex中拥有超过一个捕获组,以捕获不同的部分。你甚至可以将[:punct:]部分移动到第一个捕获组中:
sed 's/((?:AAA|BBB)[[:punct:]])([[:alnum:]]+)/\1{\3}/g' dataset.txt
I don't understand what that .
in between [:punct:]
and [:alnum:]
was doing. So, I removed it. Because of that, you might have noticed that, the regex was matching the following pattern:
我不明白那是什么。在[:punct:]和[:alnum:]之间。所以我删除了它。因此,您可能已经注意到,regex与以下模式匹配:
{AAA,aa}
{BBB,bb}
i.e, it was matching just 2 characters after AAA
and BBB
. One for .
and one for [[:alnum:]]
.
我。e,在AAA和BBB之后只匹配两个字符。一。和[[:alnum:]]。
To match all the alphanumeric characters after ,
till the next ,
you need to use quantifier: [[:alnum:]]+
要匹配所有字母数字字符,直到下一个,您需要使用量词:[[:alnum:]]+。
#3
1
Following sed should work.
sed后应该工作。
On Linux:
在Linux上:
sed -i.bak -r 's/((AAA|BBB)[[:punct:]])([[:alnum:]]+)/\1{\3}/g'
OR on OSX:
或OSX上:
sed -i.bak -E 's/((AAA|BBB)[[:punct:]])([[:alnum:]]+)/\1{\3}/g'
-i
is for inline option to save changes in the input file itself.
-我是内联选项,以保存输入文件本身的变化。
#1
1
Try this:
试试这个:
sed 's/\(AAA\|BBB\),\([^,]*\)/\1,{\2}/g' dataset.txt
#2
1
You can always have more than 1 capture groups in your regex, to capture different parts. You can even move the [:punct:]
part inside the first capture group:
您可以在regex中拥有超过一个捕获组,以捕获不同的部分。你甚至可以将[:punct:]部分移动到第一个捕获组中:
sed 's/((?:AAA|BBB)[[:punct:]])([[:alnum:]]+)/\1{\3}/g' dataset.txt
I don't understand what that .
in between [:punct:]
and [:alnum:]
was doing. So, I removed it. Because of that, you might have noticed that, the regex was matching the following pattern:
我不明白那是什么。在[:punct:]和[:alnum:]之间。所以我删除了它。因此,您可能已经注意到,regex与以下模式匹配:
{AAA,aa}
{BBB,bb}
i.e, it was matching just 2 characters after AAA
and BBB
. One for .
and one for [[:alnum:]]
.
我。e,在AAA和BBB之后只匹配两个字符。一。和[[:alnum:]]。
To match all the alphanumeric characters after ,
till the next ,
you need to use quantifier: [[:alnum:]]+
要匹配所有字母数字字符,直到下一个,您需要使用量词:[[:alnum:]]+。
#3
1
Following sed should work.
sed后应该工作。
On Linux:
在Linux上:
sed -i.bak -r 's/((AAA|BBB)[[:punct:]])([[:alnum:]]+)/\1{\3}/g'
OR on OSX:
或OSX上:
sed -i.bak -E 's/((AAA|BBB)[[:punct:]])([[:alnum:]]+)/\1{\3}/g'
-i
is for inline option to save changes in the input file itself.
-我是内联选项,以保存输入文件本身的变化。