AWK / SED提取HUGE线之间的字符串

I have a huge line that is a response from a ws, I need to get all the strings that are between <asunto> and </asunto>. The file is like this:

我有一个很大的行，是来自ws的响应，我需要获取和之间的所有字符串。该文件是这样的：

Content-Type: application/xop+xml; charset=UTF-8; type="application/soap+xml";
Content-Transfer-Encoding: binary
Content-ID: <root.message@cxf.apache.org>

<soap:Envelope xmlns:soap="http://www.w3.org/2003/05/soap-envelope"><soap:Body><ns1:consultarComunicacionesResponse xmlns:ns1="http://ve.tecno.afip.gov.ar/domain/service/ws"><ns2:RespuestaPaginada xmlns:ns2="http://ve.tecno.afip.gov.ar/domain/service/ws" xmlns:ns3="http://core.tecno.afip.gov.ar/model/ws/types" xmlns:ns4="http://ve.tecno.afip.gov.ar/domain/service/ws/types"><pagina>1</pagina><totalPaginas>1</totalPaginas><itemsPorPagina>100</itemsPorPagina><totalItems>2</totalItems><ns4:items><ns4:ComunicacionSimplificada><idComunicacion>sdfgsfdgsfdgsd</idComunicacion><cuitDestinatario>sdfgsdfgsdfgsfdg</cuitDestinatario><fechaPublicacion>sdfgsdfg</fechaPublicacion><fechaVencimiento>sdfgsdfgsdfg</fechaVencimiento><sistemaPublicador>sdfgsdfgsfg</sistemaPublicador><sistemaPublicadorDesc>sdfgsdfggf</sistemaPublicadorDesc><estado>2</estado><estadoDesc>sdfgsdfgsgf</estadoDesc><asunto>EXAMPLEEEEEEEEEEEEEEEE1</asunto><prioridad>3</prioridad><tieneAdjunto>sdfgfdg</tieneAdjunto></ns4:ComunicacionSimplificada><ns4:ComunicacionSimplificada><idComunicacion>sdfgsdfgdfg</idComunicacion><cuitDestinatario>sdfgdfsg</cuitDestinatario><fechaPublicacion>sdfgsdfg</fechaPublicacion><fechaVencimiento>sdfgdsfg</fechaVencimiento><sistemaPublicador>sdfgsdfg</sistemaPublicador><sistemaPublicadorDesc>sdfgsdfgdsfggsdf</sistemaPublicadorDesc><estado>1</estado><estadoDesc>dsfgsdfgsgd</estadoDesc><asunto>EXAMPLEEEEEEEEEEEEEEEE2</asunto><prioridad>asdfdsf</prioridad><tieneAdjunto>asdfasdf</tieneAdjunto></ns4:ComunicacionSimplificada></ns4:items></ns2:RespuestaPaginada></ns1:consultarComunicacionesResponse></soap:Body></soap:Envelope>

I shuold get something like this:

我shuold得到这样的东西：

EXAMPLEEEEEEEEEEEEEEEE1    
EXAMPLEEEEEEEEEEEEEEEE2

There may be a lot of repetition, between 0 and hundreds.

可能会有很多重复，介于0到数百之间。

Thank you!!

谢谢！！

6 个解决方案

#1

awk to the rescue!

拯救！

$ awk -v RS='[<>]' '/\/asunto/{f=0;next} f; /asunto/{f=1}' file

EXAMPLEEEEEEEEEEEEEEEE1
EXAMPLEEEEEEEEEEEEEEEE2

UPDATE: based on the comments if there is a chance that the tag exist in elsewhere you can anchor on the left and right of the open/close tags

更新：根据评论，如果标签存在于其他地方，您可以锚定在打开/关闭标签的左侧和右侧

$ awk -v RS='[<>]' '/^\/asunto$/{f=0;next} f; /^asunto$/{f=1}' file
EXAMPLEEEEEEEEEEEEEEEE1
EXAMPLEEEEEEEEEEEEEEEE2

or equivalently, check for an exact string match

或等效地，检查确切的字符串匹配

$ awk -v RS='[<>]' '$0=="/asunto"{f=0;next} f; $0=="asunto"{f=1}' file
EXAMPLEEEEEEEEEEEEEEEE1
EXAMPLEEEEEEEEEEEEEEEE2

also note that not all awk variants support multi char RS.

另请注意，并非所有awk变体都支持多字符RS。

#2

With GNU awk for multi-char RS:

使用GNU awk进行多字符RS：

$ awk -v RS='</?asunto>' '!(NR%2)' file
EXAMPLEEEEEEEEEEEEEEEE1
EXAMPLEEEEEEEEEEEEEEEE2

#3

You can also use GNU grep.

你也可以使用GNU grep。

grep -oP '(?<=<asunto>)((?!</asunto>).)+(?=</asunto>)' yourfile

This takes advantage of Lookbehind plus Negative and Positive Lookahead.

这利用了Lookbehind加上Negative和Positive Lookahead。

Here's a nice explanation of its internals.

这是对其内部的一个很好的解释。

Performance

性能

$ wc -l bigfile 
100000 bigfile

$ time awk -v RS='</?asunto>' '!(NR%2)' bigfile >/dev/null

real  0m0.277s
user  0m0.254s
sys 0m0.022s


$ time grep -oP '(?<=<asunto>)((?!</asunto>).)+(?=</asunto>)' bigfile >/dev/null

real  0m4.318s
user  0m4.292s
sys 0m0.020s

$ time awk -v RS='[<>]' '/\/asunto/{f=0;next} f; /asunto/{f=1}' bigfile >/dev/null

real  0m7.088s
user  0m6.928s
sys 0m0.021s

@Ed code achieve the greatest performance by far.

@Ed代码到目前为止实现了最大的性能。

#4

Using an XML parser (and awk to remove the header)

使用XML解析器（和awk删除标头）

awk -v RS= 'NR>1' ws.out | xmlstarlet sel  -t -v //asunto -n

#5

This might work for you (GNU sed):

这可能适合你（GNU sed）：

sed -nr '/<asunto>([^<]*)<\/asunto>/{s//\n\1\n/;s/[^\n]*\n//;P;D}' file

This reduces the string to a prepended line and then prints, deletes this line and repeats. Lines not containing the required string are ignored.

这会将字符串缩减为前置行，然后打印，删除此行并重复。不包含所需字符串的行将被忽略。

#6

As pointed out elsewhere, an XML-aware tool would in principle be safer, but the following GNU grep incantation may be useful if there is no nesting of the "asunto" tags, and will work even if the string between <asunto> and </asunto> is empty or contains other tags:

正如其他地方所指出的，XML感知工具原则上会更安全，但如果没有“asunto”标记的嵌套，以下GNU grep咒语可能会有用，即使和 <之间的字符串也能正常工作 asunto> 为空或包含其他标记：

grep -oP '(?<=<asunto>).*?(?=</asunto>)'

The key here is the non-greedy subexpression: .*?

这里的关键是非贪婪的子表达式：。*？

#1