从简单的XML文件中提取数据

时间:2022-06-09 08:56:03

I've a XML file with the contents:

我有一个包含内容的XML文件:

<?xml version="1.0" encoding="utf-8"?>
<job xmlns="http://www.sample.com/">programming</job>

I need a way to extract what is in the <job..> </job> tags, programmin in this case. This should be done on linux command prompt, using grep/sed/awk.

我需要一种方法来提取 <工作> 中的东西。> 标签,本例中的程序。这应该在linux命令提示符上执行,使用grep/sed/awk。

9 个解决方案

#1


53  

Do you really have to use only those tools? They're not designed for XML processing, and although it's possible to get something that works OK most of the time, it will fail on edge cases, like encoding, line breaks, etc.

你真的必须只用那些工具吗?它们并不是为XML处理而设计的,虽然在大多数情况下可以得到一些正常工作的东西,但是在边缘情况下会失败,比如编码、换行等等。

I recommend xml_grep:

我建议xml_grep:

xml_grep 'job' jobs.xml --text_only

Which gives the output:

使输出:

programming

On ubuntu/debian, xml_grep is in the xml-twig-tools package.

在ubuntu/debian中,xml_grep位于xml-twig-tools包中。

#2


12  

 grep '<job' file_name | cut -f2 -d">"|cut -f1 -d"<"

#3


9  

Please don't use line and regex based parsing on XML. It is a bad idea. You can have semantically identical XML with different formatting, and regex and line based parsing simply cannot cope with it.

请不要使用基于XML的行和正则表达式解析。这是个坏主意。您可以使用具有不同格式的语义相同的XML,而且基于正则表达式和行的解析根本无法处理它。

Things like unary tags and variable line wrapping - these snippets 'say' the same thing:

比如一元标签和变量换行——这些代码片段“说”同样的话:

<root>
  <sometag val1="fish" val2="carrot" val3="narf"></sometag>
</root>


<root>
  <sometag
      val1="fish"
      val2="carrot"
      val3="narf"></sometag>
</root>

<root
><sometag
val1="fish"
val2="carrot"
val3="narf"
></sometag></root>

<root><sometag val1="fish" val2="carrot" val3="narf"/></root>

Hopefully this makes it clear why making a regex/line based parser is difficult? Fortunately, you don't need to. Many scripting languages have at least one, sometimes more parser options.

希望这能清楚地说明为什么创建基于regex/line的解析器很困难?幸运的是,您不需要这么做。许多脚本语言至少有一个,有时更多的解析器选项。

As a previous poster has alluded to - xml_grep is available. That's actually a tool based off the XML::Twig perl library. However what it does is use 'xpath expressions' to find something, and differentiates between document structure, attributes and 'content'.

正如前面的海报中提到的- xml_grep是可用的。这实际上是一个基于XML::Twig perl库的工具。然而,它所做的是使用“xpath表达式”来查找某些内容,并区分文档结构、属性和“内容”。

E.g.:

例如:

xml_grep 'job' jobs.xml --text_only

However in the interest of making better answers, here's a couple of examples of 'roll your own' based on your source data:

但是为了更好的回答,这里有几个基于你的源数据的“滚你自己”的例子:

First way:

第一个方法:

Use twig handlers that catches elements of a particular type and acts on them. The advantage of doing it this way is it parses the XML 'as you go', and lets you modify it in flight if you need to. This is particularly useful for discarding 'processed' XML when you're working with large files, using purge or flush:

使用twig处理程序捕获特定类型的元素并对其进行操作。这样做的好处是,它可以“随用随用”地解析XML,如果需要,还可以动态地修改它。这对于在处理大型文件时丢弃“处理”XML特别有用,使用清除或刷新:

#!/usr/bin/perl

use strict;
use warnings;

use XML::Twig;

XML::Twig->new(
    twig_handlers => {
        'job' => sub { print $_ ->text }
    }
    )->parse( <> );

Which will use <> to take input (piped in, or specified via commandline ./myscript somefile.xml) and process it - each job element, it'll extract and print any text associated. (You might want print $_ -> text,"\n" to insert a linefeed).

它将使用<>来接收输入(通过命令行输入,或者通过命令行指定)和处理它——每个工作元素,它将提取并打印任何相关的文本。(您可能希望打印$_ ->文本,“\n”以插入换行)。

Because it's matching on 'job' elements, it'll also match on nested job elements:

因为它匹配了“job”元素,它还将匹配嵌套的工作元素:

<job>programming
    <job>anotherjob</job>
</job>

Will match twice, but print some of the output twice too. You can however, match on /job instead if you prefer. Usefully - this lets you e.g. print and delete an element or copy and paste one modifying the XML structure.

将匹配两次,但也要打印两次输出。不过,如果你愿意的话,你也可以在工作上找到另一半。这可以让您打印和删除一个元素,或者复制并粘贴一个修改XML结构的元素。

Alternatively - parse first, and 'print' based on structure:

或者——先解析,根据结构“打印”:

my $twig = XML::Twig->new( )->parse( <> );
print $twig -> root -> text;

As job is your root element, all we need do is print the text of it.

由于job是根元素,所以我们只需要打印它的文本。

But we can be a bit more discerning, and look for job or /job and print that specifically instead:

但我们可以更有眼光,寻找工作或/工作,并专门打印出来:

my $twig = XML::Twig->new( )->parse( <> );
print $twig -> findnodes('/job',0)->text;

You can use XML::Twigs pretty_print option to reformat your XML too:

您也可以使用XML::Twigs pretty_print选项来重新格式化您的XML:

XML::Twig->new( 'pretty_print' => 'indented_a' )->parse( <> ) -> print;

There's a variety of output format options, but for simpler XML (like yours) most will look pretty similar.

有多种输出格式选项,但是对于更简单的XML(如您的),大多数都看起来非常相似。

#4


7  

just use awk, no need other external tools. Below works if your desired tags appears in multitine.

只需使用awk,不需要其他外部工具。如果你想要的标签出现在multitine,下面的操作就可以了。

$ cat file
test
<job xmlns="http://www.sample.com/">programming</job>
<job xmlns="http://www.sample.com/">
programming</job>

$ awk -vRS="</job>" '{gsub(/.*<job.*>/,"");print}' file
programming

programming

#5


7  

Using xmlstarlet:

使用xmlstarlet:

echo '<job xmlns="http://www.sample.com/">programming</job>' | \
   xmlstarlet sel -N var="http://www.sample.com/" -t -m "//var:job" -v '.'

#6


4  

Assuming same line, input from stdin:

假设行相同,输入来自stdin:

sed -ne '/<\/job>/ { s/<[^>]*>\(.*\)<\/job>/\1/; p }'

notes: -n stops it outputting everything automatically; -e means it's a one-liner (aot a script) /<\/job> acts like a grep; s strips the opentag + attributes and endtag; ; is a new statement; p prints; {} makes the grep apply to both statements, as one.

注:-n停止一切自动输出;-e表示它是一行代码(aot a script) /<\/job>扮演grep;去掉opentag +属性和endtag;;是一个新的报表;p输出;{}使grep作为一个语句应用于两个语句。

#7


2  

Using sed command:

使用sed命令:

Example:

例子:

$ cat file.xml
<note>
        <to>Tove</to>
                <from>Jani</from>
                <heading>Reminder</heading>
        <body>Don't forget me this weekend!</body>
</note>

$ cat file.xml | sed -ne '/<heading>/s#\s*<[^>]*>\s*##gp'
Reminder

Explanation:

解释:

cat file.xml | sed -ne '/<pattern_to_find>/s#\s*<[^>]*>\s*##gp'

猫文件。xml | sed - ne / < pattern_to_find > /年代# \ s * <[^ >]* > \ s * # #全科医生”

n - suppress printing all lines
e - script

n -抑制打印所有行e - script

/<pattern_to_find>/ - finds lines that contain specified pattern what could be e.g.<heading>

/ / -发现包含指定模式的行例如

next is substitution part s///pthat removes everything except desired value where / is replaced with # for better readability:

下一个是替换部分s///pthat除去除了想要的值之外的所有东西,其中/被#替换为更好的可读性:

s#\s*<[^>]*>\s*##gp
\s* - includes white-spaces if exist (same at the end)
<[^>]*> represents <xml_tag> as non-greedy regex alternative cause <.*?> does not work for sed
g - substitutes everything e.g. closing xml </xml_tag> tag

# \ s * <[^ >]* > \ s * # #全科医生\ s *——包括空白如果存在相同(最后)<[^ >]* > < xml_tag >表示为贪婪的正则表达式替代导致<。* ?>并不适用于sed g -替代所有东西,例如关闭xml 标记

#8


0  

How about:

如何:

cat a.xml | grep '<job' | cut -d '>' -f 2 | cut -d '<' -f 1

#9


0  

A bit late to the show.

有点晚了。

xmlcutty cuts out nodes from XML:

xmlcutty从XML中删除节点:

$ cat file.xml
<?xml version="1.0" encoding="utf-8"?>
<job xmlns="http://www.sample.com/">programming</job>
<job xmlns="http://www.sample.com/">designing</job>
<job xmlns="http://www.sample.com/">managing</job>
<job xmlns="http://www.sample.com/">teaching</job>

The path argument names the path to the element you want to cut out. In this case, since we are not interested in the tags at all, we rename the tag to \n, so we get a nice list:

路径参数指定要删除的元素的路径。在这种情况下,由于我们对标签完全不感兴趣,我们将标签重命名为\n,因此我们得到了一个很好的列表:

$ xmlcutty -path /job -rename '\n' file.xml
programming
designing
managing
teaching

Note, that the XML was not valid to begin with (no root element). xmlcutty can work with slightly broken XML, too.

注意,XML开头是无效的(没有根元素)。xmlcutty也可以使用略微破损的XML。

#1


53  

Do you really have to use only those tools? They're not designed for XML processing, and although it's possible to get something that works OK most of the time, it will fail on edge cases, like encoding, line breaks, etc.

你真的必须只用那些工具吗?它们并不是为XML处理而设计的,虽然在大多数情况下可以得到一些正常工作的东西,但是在边缘情况下会失败,比如编码、换行等等。

I recommend xml_grep:

我建议xml_grep:

xml_grep 'job' jobs.xml --text_only

Which gives the output:

使输出:

programming

On ubuntu/debian, xml_grep is in the xml-twig-tools package.

在ubuntu/debian中,xml_grep位于xml-twig-tools包中。

#2


12  

 grep '<job' file_name | cut -f2 -d">"|cut -f1 -d"<"

#3


9  

Please don't use line and regex based parsing on XML. It is a bad idea. You can have semantically identical XML with different formatting, and regex and line based parsing simply cannot cope with it.

请不要使用基于XML的行和正则表达式解析。这是个坏主意。您可以使用具有不同格式的语义相同的XML,而且基于正则表达式和行的解析根本无法处理它。

Things like unary tags and variable line wrapping - these snippets 'say' the same thing:

比如一元标签和变量换行——这些代码片段“说”同样的话:

<root>
  <sometag val1="fish" val2="carrot" val3="narf"></sometag>
</root>


<root>
  <sometag
      val1="fish"
      val2="carrot"
      val3="narf"></sometag>
</root>

<root
><sometag
val1="fish"
val2="carrot"
val3="narf"
></sometag></root>

<root><sometag val1="fish" val2="carrot" val3="narf"/></root>

Hopefully this makes it clear why making a regex/line based parser is difficult? Fortunately, you don't need to. Many scripting languages have at least one, sometimes more parser options.

希望这能清楚地说明为什么创建基于regex/line的解析器很困难?幸运的是,您不需要这么做。许多脚本语言至少有一个,有时更多的解析器选项。

As a previous poster has alluded to - xml_grep is available. That's actually a tool based off the XML::Twig perl library. However what it does is use 'xpath expressions' to find something, and differentiates between document structure, attributes and 'content'.

正如前面的海报中提到的- xml_grep是可用的。这实际上是一个基于XML::Twig perl库的工具。然而,它所做的是使用“xpath表达式”来查找某些内容,并区分文档结构、属性和“内容”。

E.g.:

例如:

xml_grep 'job' jobs.xml --text_only

However in the interest of making better answers, here's a couple of examples of 'roll your own' based on your source data:

但是为了更好的回答,这里有几个基于你的源数据的“滚你自己”的例子:

First way:

第一个方法:

Use twig handlers that catches elements of a particular type and acts on them. The advantage of doing it this way is it parses the XML 'as you go', and lets you modify it in flight if you need to. This is particularly useful for discarding 'processed' XML when you're working with large files, using purge or flush:

使用twig处理程序捕获特定类型的元素并对其进行操作。这样做的好处是,它可以“随用随用”地解析XML,如果需要,还可以动态地修改它。这对于在处理大型文件时丢弃“处理”XML特别有用,使用清除或刷新:

#!/usr/bin/perl

use strict;
use warnings;

use XML::Twig;

XML::Twig->new(
    twig_handlers => {
        'job' => sub { print $_ ->text }
    }
    )->parse( <> );

Which will use <> to take input (piped in, or specified via commandline ./myscript somefile.xml) and process it - each job element, it'll extract and print any text associated. (You might want print $_ -> text,"\n" to insert a linefeed).

它将使用<>来接收输入(通过命令行输入,或者通过命令行指定)和处理它——每个工作元素,它将提取并打印任何相关的文本。(您可能希望打印$_ ->文本,“\n”以插入换行)。

Because it's matching on 'job' elements, it'll also match on nested job elements:

因为它匹配了“job”元素,它还将匹配嵌套的工作元素:

<job>programming
    <job>anotherjob</job>
</job>

Will match twice, but print some of the output twice too. You can however, match on /job instead if you prefer. Usefully - this lets you e.g. print and delete an element or copy and paste one modifying the XML structure.

将匹配两次,但也要打印两次输出。不过,如果你愿意的话,你也可以在工作上找到另一半。这可以让您打印和删除一个元素,或者复制并粘贴一个修改XML结构的元素。

Alternatively - parse first, and 'print' based on structure:

或者——先解析,根据结构“打印”:

my $twig = XML::Twig->new( )->parse( <> );
print $twig -> root -> text;

As job is your root element, all we need do is print the text of it.

由于job是根元素,所以我们只需要打印它的文本。

But we can be a bit more discerning, and look for job or /job and print that specifically instead:

但我们可以更有眼光,寻找工作或/工作,并专门打印出来:

my $twig = XML::Twig->new( )->parse( <> );
print $twig -> findnodes('/job',0)->text;

You can use XML::Twigs pretty_print option to reformat your XML too:

您也可以使用XML::Twigs pretty_print选项来重新格式化您的XML:

XML::Twig->new( 'pretty_print' => 'indented_a' )->parse( <> ) -> print;

There's a variety of output format options, but for simpler XML (like yours) most will look pretty similar.

有多种输出格式选项,但是对于更简单的XML(如您的),大多数都看起来非常相似。

#4


7  

just use awk, no need other external tools. Below works if your desired tags appears in multitine.

只需使用awk,不需要其他外部工具。如果你想要的标签出现在multitine,下面的操作就可以了。

$ cat file
test
<job xmlns="http://www.sample.com/">programming</job>
<job xmlns="http://www.sample.com/">
programming</job>

$ awk -vRS="</job>" '{gsub(/.*<job.*>/,"");print}' file
programming

programming

#5


7  

Using xmlstarlet:

使用xmlstarlet:

echo '<job xmlns="http://www.sample.com/">programming</job>' | \
   xmlstarlet sel -N var="http://www.sample.com/" -t -m "//var:job" -v '.'

#6


4  

Assuming same line, input from stdin:

假设行相同,输入来自stdin:

sed -ne '/<\/job>/ { s/<[^>]*>\(.*\)<\/job>/\1/; p }'

notes: -n stops it outputting everything automatically; -e means it's a one-liner (aot a script) /<\/job> acts like a grep; s strips the opentag + attributes and endtag; ; is a new statement; p prints; {} makes the grep apply to both statements, as one.

注:-n停止一切自动输出;-e表示它是一行代码(aot a script) /<\/job>扮演grep;去掉opentag +属性和endtag;;是一个新的报表;p输出;{}使grep作为一个语句应用于两个语句。

#7


2  

Using sed command:

使用sed命令:

Example:

例子:

$ cat file.xml
<note>
        <to>Tove</to>
                <from>Jani</from>
                <heading>Reminder</heading>
        <body>Don't forget me this weekend!</body>
</note>

$ cat file.xml | sed -ne '/<heading>/s#\s*<[^>]*>\s*##gp'
Reminder

Explanation:

解释:

cat file.xml | sed -ne '/<pattern_to_find>/s#\s*<[^>]*>\s*##gp'

猫文件。xml | sed - ne / < pattern_to_find > /年代# \ s * <[^ >]* > \ s * # #全科医生”

n - suppress printing all lines
e - script

n -抑制打印所有行e - script

/<pattern_to_find>/ - finds lines that contain specified pattern what could be e.g.<heading>

/ / -发现包含指定模式的行例如

next is substitution part s///pthat removes everything except desired value where / is replaced with # for better readability:

下一个是替换部分s///pthat除去除了想要的值之外的所有东西,其中/被#替换为更好的可读性:

s#\s*<[^>]*>\s*##gp
\s* - includes white-spaces if exist (same at the end)
<[^>]*> represents <xml_tag> as non-greedy regex alternative cause <.*?> does not work for sed
g - substitutes everything e.g. closing xml </xml_tag> tag

# \ s * <[^ >]* > \ s * # #全科医生\ s *——包括空白如果存在相同(最后)<[^ >]* > < xml_tag >表示为贪婪的正则表达式替代导致<。* ?>并不适用于sed g -替代所有东西,例如关闭xml 标记

#8


0  

How about:

如何:

cat a.xml | grep '<job' | cut -d '>' -f 2 | cut -d '<' -f 1

#9


0  

A bit late to the show.

有点晚了。

xmlcutty cuts out nodes from XML:

xmlcutty从XML中删除节点:

$ cat file.xml
<?xml version="1.0" encoding="utf-8"?>
<job xmlns="http://www.sample.com/">programming</job>
<job xmlns="http://www.sample.com/">designing</job>
<job xmlns="http://www.sample.com/">managing</job>
<job xmlns="http://www.sample.com/">teaching</job>

The path argument names the path to the element you want to cut out. In this case, since we are not interested in the tags at all, we rename the tag to \n, so we get a nice list:

路径参数指定要删除的元素的路径。在这种情况下,由于我们对标签完全不感兴趣,我们将标签重命名为\n,因此我们得到了一个很好的列表:

$ xmlcutty -path /job -rename '\n' file.xml
programming
designing
managing
teaching

Note, that the XML was not valid to begin with (no root element). xmlcutty can work with slightly broken XML, too.

注意,XML开头是无效的(没有根元素)。xmlcutty也可以使用略微破损的XML。