如何从HTTParty中的xml文件中删除非法字符?

时间:2022-05-27 22:23:29

I was trying to download an xml file that has '&' symbols in it using the HTTParty gem and I am getting this error:

我试图使用HTTParty gem下载一个包含'&'符号的xml文件,我收到此错误:

"treeparser.rb:95:in `rescue in parse' <RuntimeError: Illegal character '&' 
 in raw string  "4860 BOOMM 10x20 MD&"> (MultiXml::ParseError)"

Here is my code:

这是我的代码:

class SAPOrders
  include HTTParty
  default_params :output => 'xml'
  format :xml
  base_uri '<webservice url>'
end

xml =  SAPOrders.get('/<nameOfFile.xml>').inspect

What am I missing?

我错过了什么?

1 个解决方案

#1


3  

If you are using HTTPParty and it's trying to parse the incoming XML before you can get your hands on it, then you'll need to split that process into the get, and the parse, so you can put code between the two.

如果您正在使用HTTPParty并且它尝试解析传入的XML,然后您可以开始使用它,那么您需要将该进程拆分为get和parse,这样您就可以在两者之间放置代码。

I use OpenURI and Nokogiri for just those reasons, but whether you use those two, or their equivalents, you will have the opportunity to pre-process the XML before parsing it. '&' is an illegal character when bare; It should be encoded or in a CDATA block, but unfortunately in the wilds of the internet, there are lots of malformed XML feeds and files.

我出于这些原因使用OpenURI和Nokogiri,但无论你使用这两者还是它们的等价物,你都有机会在解析之前预先处理XML。 '&'是裸露的非法角色;它应该被编码或者在CDATA块中,但不幸的是,在互联网的荒野中,存在许多格式错误的XML提要和文件。

The thing I like about Nokogiri for this task is it keeps on chugging, at least as far as it can. You can look to see if you had errors after the document is parsed, and you can tweak some of its parser settings to control what it will do or complain about:

我喜欢Nokogiri这项任务的事情是它一直在努力,至少尽可能地。您可以查看解析文档后是否有错误,并且可以调整其部分解析器设置以控制它将执行或抱怨的内容:

require 'nokogiri'

doc = Nokogiri::XML(<<EOT)
<a>
  <b parm="4860 BOOMM 10x20 MD&">foobar</b>
</a>
EOT

puts doc.errors
puts doc.to_xml

Which will output:

哪个会输出:

xmlParseEntityRef: no name
<?xml version="1.0"?>
<a>
  <b parm="4860 BOOMM 10x20 MD">foobar</b>
</a>

Notice that Nokogiri stripped the & but I was still able to get usable output. You have to decide whether you want an error and to halt using the STRICT option, or to continue, but Nokogiri can do either, depending on your needs.

请注意,Nokogiri剥离了&但我仍然能够获得可用的输出。您必须决定是否需要错误并停止使用STRICT选项,或继续,但Nokogiri可以根据您的需要执行任一操作。

You can massage the incoming XML:

您可以按摩传入的XML:

require 'nokogiri'

xml = <<EOT
<a>
  <b parm="4860 BOOMM 10x20 MD&">foobar</b>
</a>
EOT

xml['MD&'] = 'MD&amp;'

doc = Nokogiri::XML(xml) do |config|
  config.strict
end

puts doc.errors
puts doc.to_xml

Which now outputs:

现在输出:

<?xml version="1.0"?>
<a>
  <b parm="4860 BOOMM 10x20 MD&amp;">foobar</b>
</a>

I know this isn't a perfect answer, but from my experience dealing with a lot of RSS/Atom and XML/HTML parsing, sometimes we have to open the dirty-tricks bag and go with whatever works instead of what was elegant.

我知道这不是一个完美的答案,但根据我处理大量RSS / Atom和XML / HTML解析的经验,有时我们必须打开脏技巧包,然后选择任何工作而不是优雅。

Another path to nirvana in HTTParty, would be to sub-class the parser. You should be able to get inside that flow of the XML to the parser and massage it there. From the docs:

HTTParty中的另一个必杀技路径是对解析器进行子类化。您应该能够将XML流转到解析器并在那里按摩它。来自文档:

# Intercept the parsing for all formats
class SimpleParser < HTTParty::Parser
  def parse
    perform_parsing
  end
end

#1


3  

If you are using HTTPParty and it's trying to parse the incoming XML before you can get your hands on it, then you'll need to split that process into the get, and the parse, so you can put code between the two.

如果您正在使用HTTPParty并且它尝试解析传入的XML,然后您可以开始使用它,那么您需要将该进程拆分为get和parse,这样您就可以在两者之间放置代码。

I use OpenURI and Nokogiri for just those reasons, but whether you use those two, or their equivalents, you will have the opportunity to pre-process the XML before parsing it. '&' is an illegal character when bare; It should be encoded or in a CDATA block, but unfortunately in the wilds of the internet, there are lots of malformed XML feeds and files.

我出于这些原因使用OpenURI和Nokogiri,但无论你使用这两者还是它们的等价物,你都有机会在解析之前预先处理XML。 '&'是裸露的非法角色;它应该被编码或者在CDATA块中,但不幸的是,在互联网的荒野中,存在许多格式错误的XML提要和文件。

The thing I like about Nokogiri for this task is it keeps on chugging, at least as far as it can. You can look to see if you had errors after the document is parsed, and you can tweak some of its parser settings to control what it will do or complain about:

我喜欢Nokogiri这项任务的事情是它一直在努力,至少尽可能地。您可以查看解析文档后是否有错误,并且可以调整其部分解析器设置以控制它将执行或抱怨的内容:

require 'nokogiri'

doc = Nokogiri::XML(<<EOT)
<a>
  <b parm="4860 BOOMM 10x20 MD&">foobar</b>
</a>
EOT

puts doc.errors
puts doc.to_xml

Which will output:

哪个会输出:

xmlParseEntityRef: no name
<?xml version="1.0"?>
<a>
  <b parm="4860 BOOMM 10x20 MD">foobar</b>
</a>

Notice that Nokogiri stripped the & but I was still able to get usable output. You have to decide whether you want an error and to halt using the STRICT option, or to continue, but Nokogiri can do either, depending on your needs.

请注意,Nokogiri剥离了&但我仍然能够获得可用的输出。您必须决定是否需要错误并停止使用STRICT选项,或继续,但Nokogiri可以根据您的需要执行任一操作。

You can massage the incoming XML:

您可以按摩传入的XML:

require 'nokogiri'

xml = <<EOT
<a>
  <b parm="4860 BOOMM 10x20 MD&">foobar</b>
</a>
EOT

xml['MD&'] = 'MD&amp;'

doc = Nokogiri::XML(xml) do |config|
  config.strict
end

puts doc.errors
puts doc.to_xml

Which now outputs:

现在输出:

<?xml version="1.0"?>
<a>
  <b parm="4860 BOOMM 10x20 MD&amp;">foobar</b>
</a>

I know this isn't a perfect answer, but from my experience dealing with a lot of RSS/Atom and XML/HTML parsing, sometimes we have to open the dirty-tricks bag and go with whatever works instead of what was elegant.

我知道这不是一个完美的答案,但根据我处理大量RSS / Atom和XML / HTML解析的经验,有时我们必须打开脏技巧包,然后选择任何工作而不是优雅。

Another path to nirvana in HTTParty, would be to sub-class the parser. You should be able to get inside that flow of the XML to the parser and massage it there. From the docs:

HTTParty中的另一个必杀技路径是对解析器进行子类化。您应该能够将XML流转到解析器并在那里按摩它。来自文档:

# Intercept the parsing for all formats
class SimpleParser < HTTParty::Parser
  def parse
    perform_parsing
  end
end