I have a huge line that is a response from a ws, I need to get all the strings that are between <asunto>
and </asunto>
. The file is like this:
我有一个很大的行,是来自ws的响应,我需要获取
Content-Type: application/xop+xml; charset=UTF-8; type="application/soap+xml";
Content-Transfer-Encoding: binary
Content-ID: <root.message@cxf.apache.org>
<soap:Envelope xmlns:soap="http://www.w3.org/2003/05/soap-envelope"><soap:Body><ns1:consultarComunicacionesResponse xmlns:ns1="http://ve.tecno.afip.gov.ar/domain/service/ws"><ns2:RespuestaPaginada xmlns:ns2="http://ve.tecno.afip.gov.ar/domain/service/ws" xmlns:ns3="http://core.tecno.afip.gov.ar/model/ws/types" xmlns:ns4="http://ve.tecno.afip.gov.ar/domain/service/ws/types"><pagina>1</pagina><totalPaginas>1</totalPaginas><itemsPorPagina>100</itemsPorPagina><totalItems>2</totalItems><ns4:items><ns4:ComunicacionSimplificada><idComunicacion>sdfgsfdgsfdgsd</idComunicacion><cuitDestinatario>sdfgsdfgsdfgsfdg</cuitDestinatario><fechaPublicacion>sdfgsdfg</fechaPublicacion><fechaVencimiento>sdfgsdfgsdfg</fechaVencimiento><sistemaPublicador>sdfgsdfgsfg</sistemaPublicador><sistemaPublicadorDesc>sdfgsdfggf</sistemaPublicadorDesc><estado>2</estado><estadoDesc>sdfgsdfgsgf</estadoDesc><asunto>EXAMPLEEEEEEEEEEEEEEEE1</asunto><prioridad>3</prioridad><tieneAdjunto>sdfgfdg</tieneAdjunto></ns4:ComunicacionSimplificada><ns4:ComunicacionSimplificada><idComunicacion>sdfgsdfgdfg</idComunicacion><cuitDestinatario>sdfgdfsg</cuitDestinatario><fechaPublicacion>sdfgsdfg</fechaPublicacion><fechaVencimiento>sdfgdsfg</fechaVencimiento><sistemaPublicador>sdfgsdfg</sistemaPublicador><sistemaPublicadorDesc>sdfgsdfgdsfggsdf</sistemaPublicadorDesc><estado>1</estado><estadoDesc>dsfgsdfgsgd</estadoDesc><asunto>EXAMPLEEEEEEEEEEEEEEEE2</asunto><prioridad>asdfdsf</prioridad><tieneAdjunto>asdfasdf</tieneAdjunto></ns4:ComunicacionSimplificada></ns4:items></ns2:RespuestaPaginada></ns1:consultarComunicacionesResponse></soap:Body></soap:Envelope>
I shuold get something like this:
我shuold得到这样的东西:
EXAMPLEEEEEEEEEEEEEEEE1
EXAMPLEEEEEEEEEEEEEEEE2
There may be a lot of repetition, between 0 and hundreds.
可能会有很多重复,介于0到数百之间。
Thank you!!
谢谢!!
6 个解决方案
#1
0
awk
to the rescue!
拯救!
$ awk -v RS='[<>]' '/\/asunto/{f=0;next} f; /asunto/{f=1}' file
EXAMPLEEEEEEEEEEEEEEEE1
EXAMPLEEEEEEEEEEEEEEEE2
UPDATE: based on the comments if there is a chance that the tag exist in elsewhere you can anchor on the left and right of the open/close tags
更新:根据评论,如果标签存在于其他地方,您可以锚定在打开/关闭标签的左侧和右侧
$ awk -v RS='[<>]' '/^\/asunto$/{f=0;next} f; /^asunto$/{f=1}' file
EXAMPLEEEEEEEEEEEEEEEE1
EXAMPLEEEEEEEEEEEEEEEE2
or equivalently, check for an exact string match
或等效地,检查确切的字符串匹配
$ awk -v RS='[<>]' '$0=="/asunto"{f=0;next} f; $0=="asunto"{f=1}' file
EXAMPLEEEEEEEEEEEEEEEE1
EXAMPLEEEEEEEEEEEEEEEE2
also note that not all awk
variants support multi char RS.
另请注意,并非所有awk变体都支持多字符RS。
#2
1
With GNU awk for multi-char RS:
使用GNU awk进行多字符RS:
$ awk -v RS='</?asunto>' '!(NR%2)' file
EXAMPLEEEEEEEEEEEEEEEE1
EXAMPLEEEEEEEEEEEEEEEE2
#3
1
You can also use GNU grep
.
你也可以使用GNU grep。
grep -oP '(?<=<asunto>)((?!</asunto>).)+(?=</asunto>)' yourfile
This takes advantage of Lookbehind plus Negative and Positive Lookahead.
这利用了Lookbehind加上Negative和Positive Lookahead。
Here's a nice explanation of its internals.
这是对其内部的一个很好的解释。
Performance
性能
$ wc -l bigfile
100000 bigfile
$ time awk -v RS='</?asunto>' '!(NR%2)' bigfile >/dev/null
real 0m0.277s
user 0m0.254s
sys 0m0.022s
$ time grep -oP '(?<=<asunto>)((?!</asunto>).)+(?=</asunto>)' bigfile >/dev/null
real 0m4.318s
user 0m4.292s
sys 0m0.020s
$ time awk -v RS='[<>]' '/\/asunto/{f=0;next} f; /asunto/{f=1}' bigfile >/dev/null
real 0m7.088s
user 0m6.928s
sys 0m0.021s
@Ed code achieve the greatest performance by far.
@Ed代码到目前为止实现了最大的性能。
#4
0
Using an XML parser (and awk to remove the header)
使用XML解析器(和awk删除标头)
awk -v RS= 'NR>1' ws.out | xmlstarlet sel -t -v //asunto -n
#5
0
This might work for you (GNU sed):
这可能适合你(GNU sed):
sed -nr '/<asunto>([^<]*)<\/asunto>/{s//\n\1\n/;s/[^\n]*\n//;P;D}' file
This reduces the string to a prepended line and then prints, deletes this line and repeats. Lines not containing the required string are ignored.
这会将字符串缩减为前置行,然后打印,删除此行并重复。不包含所需字符串的行将被忽略。
#6
0
As pointed out elsewhere, an XML-aware tool would in principle be safer, but the following GNU grep incantation may be useful if there is no nesting of the "asunto" tags, and will work even if the string between <asunto>
and </asunto>
is empty or contains other tags:
正如其他地方所指出的,XML感知工具原则上会更安全,但如果没有“asunto”标记的嵌套,以下GNU grep咒语可能会有用,即使
grep -oP '(?<=<asunto>).*?(?=</asunto>)'
The key here is the non-greedy subexpression: .*?
这里的关键是非贪婪的子表达式:。*?
#1
0
awk
to the rescue!
拯救!
$ awk -v RS='[<>]' '/\/asunto/{f=0;next} f; /asunto/{f=1}' file
EXAMPLEEEEEEEEEEEEEEEE1
EXAMPLEEEEEEEEEEEEEEEE2
UPDATE: based on the comments if there is a chance that the tag exist in elsewhere you can anchor on the left and right of the open/close tags
更新:根据评论,如果标签存在于其他地方,您可以锚定在打开/关闭标签的左侧和右侧
$ awk -v RS='[<>]' '/^\/asunto$/{f=0;next} f; /^asunto$/{f=1}' file
EXAMPLEEEEEEEEEEEEEEEE1
EXAMPLEEEEEEEEEEEEEEEE2
or equivalently, check for an exact string match
或等效地,检查确切的字符串匹配
$ awk -v RS='[<>]' '$0=="/asunto"{f=0;next} f; $0=="asunto"{f=1}' file
EXAMPLEEEEEEEEEEEEEEEE1
EXAMPLEEEEEEEEEEEEEEEE2
also note that not all awk
variants support multi char RS.
另请注意,并非所有awk变体都支持多字符RS。
#2
1
With GNU awk for multi-char RS:
使用GNU awk进行多字符RS:
$ awk -v RS='</?asunto>' '!(NR%2)' file
EXAMPLEEEEEEEEEEEEEEEE1
EXAMPLEEEEEEEEEEEEEEEE2
#3
1
You can also use GNU grep
.
你也可以使用GNU grep。
grep -oP '(?<=<asunto>)((?!</asunto>).)+(?=</asunto>)' yourfile
This takes advantage of Lookbehind plus Negative and Positive Lookahead.
这利用了Lookbehind加上Negative和Positive Lookahead。
Here's a nice explanation of its internals.
这是对其内部的一个很好的解释。
Performance
性能
$ wc -l bigfile
100000 bigfile
$ time awk -v RS='</?asunto>' '!(NR%2)' bigfile >/dev/null
real 0m0.277s
user 0m0.254s
sys 0m0.022s
$ time grep -oP '(?<=<asunto>)((?!</asunto>).)+(?=</asunto>)' bigfile >/dev/null
real 0m4.318s
user 0m4.292s
sys 0m0.020s
$ time awk -v RS='[<>]' '/\/asunto/{f=0;next} f; /asunto/{f=1}' bigfile >/dev/null
real 0m7.088s
user 0m6.928s
sys 0m0.021s
@Ed code achieve the greatest performance by far.
@Ed代码到目前为止实现了最大的性能。
#4
0
Using an XML parser (and awk to remove the header)
使用XML解析器(和awk删除标头)
awk -v RS= 'NR>1' ws.out | xmlstarlet sel -t -v //asunto -n
#5
0
This might work for you (GNU sed):
这可能适合你(GNU sed):
sed -nr '/<asunto>([^<]*)<\/asunto>/{s//\n\1\n/;s/[^\n]*\n//;P;D}' file
This reduces the string to a prepended line and then prints, deletes this line and repeats. Lines not containing the required string are ignored.
这会将字符串缩减为前置行,然后打印,删除此行并重复。不包含所需字符串的行将被忽略。
#6
0
As pointed out elsewhere, an XML-aware tool would in principle be safer, but the following GNU grep incantation may be useful if there is no nesting of the "asunto" tags, and will work even if the string between <asunto>
and </asunto>
is empty or contains other tags:
正如其他地方所指出的,XML感知工具原则上会更安全,但如果没有“asunto”标记的嵌套,以下GNU grep咒语可能会有用,即使
grep -oP '(?<=<asunto>).*?(?=</asunto>)'
The key here is the non-greedy subexpression: .*?
这里的关键是非贪婪的子表达式:。*?