我不想在XML中解析一些标签

时间:2022-12-20 19:38:52

Currently this would be a sample XML that I am working on:

目前,这将是我正在研究的示例XML:

<smsq>
  <sms>
  <id>96</id>
  <to>03333560511</to>
  <msg>  danial says: hahaha <space> nothing.
  </msg>
  </sms>
</smsq>

Now please notice, that the tag can contain other tags (which should not be parsed) and I had to make a dtd for that. The dtd was something like this:

现在请注意,标签可以包含其他标签(不应该解析),我必须为此做一个dtd。 dtd是这样的:

<!DOCTYPE smsq [
  <!ELEMENT sms (mID,to,msg,type)>
  <!ELEMENT mID (#PCDATA)>
  <!ELEMENT to (#PCDATA)>
  <!ELEMENT msg (CDATA)>
]>

But the problem is that XML parser still goes in the tag and says that the tag should be closed with a tag. I just want to fetch the data as it is from the XML and I do not want to parse msg further.

但问题是XML解析器仍然在标签中,并表示标签应该用标签关闭。我只是想从XML中获取数据,我不想进一步解析msg。

Please help me resolve the problem and tell me if this can be done with DTDs.

请帮我解决问题,并告诉我是否可以使用DTD完成此操作。

Thanks!

5 个解决方案

#1


1  

Firstly the sample xml is not really xml as the "space" tag is not closed.

首先,样本xml不是真正的xml,因为“space”标记未关闭。

Secondly, it looks like the reason for not wanting to parse the "space" tag is because it's not really xml - just text that looks like xml. The text should be either escaped/encoded or enclosed in CDATA tags.

其次,看起来不想解析“space”标签的原因是因为它不是真正的xml - 只是看起来像xml的文本。文本应该转义/编码或包含在CDATA标记中。

Lastly - if what you want to parse really is xml and you only want to parse the first level tags. I wouldn't bother with a real XML parser - i'd create my own ultra-simple parser - all it has to do is parse 1st level nodes - that shouldn't be too hard.

最后 - 如果您想要解析的是xml,并且您只想解析第一级标记。我不打算使用真正的XML解析器 - 我会创建自己的超简单解析器 - 它所要做的就是解析一级节点 - 这不应该太难。

Good luck!

#2


4  

You can't make a DTD that makes buggy XML magically not buggy. The XML is not well-formed, so it can never be valid as well-formedness is a prerequisite of validity (validity isn't even important here AFAICT). It's analogous to how the words in an English sentence have to all be English words before it can be a gramatically-correct English sentence.

你不能制作一个DTD,使错误的XML神奇地没有错误。 XML格式不正确,所以它永远不会有效,因为良好的形成是有效性的先决条件(有效性在AFAICT中甚至不重要)。它类似于英语句子中的单词如何都是英语单词,然后才能成为一个重要的英语句子。

<space> is not closed. It should either have a following </space> inside the <msg>, be replaced with <space/> or if by saying you don't want it to be paresed you mean you want the actual text "<space>" in there, then you should encode it as such (i.e. &lt;space&gt;).

未关闭。它应该在 中有一个跟随 ,用 替换,或者如果说你不想让它被paresed你意味着你想要那里的实际文本“ ” ,那么你应该这样编码(即< space>)。

#3


3  

DTD can't help you with this problem. DTD is by no means required (though it is quite handy to have it).

DTD无法帮助您解决此问题。 DTD绝不是必需的(尽管它非常方便)。

The document you posted above is not a valid XML document. Period. That's the way it is, and no reasonable XML parser will parse it for you without raising the error.

您在上面发布的文档不是有效的XML文档。期。就是这样,没有合理的XML解析器会为你解析它而不会引发错误。

What you can do though is to substitute < symbol with a &lt; XML entity.

你可以做的是用<&替换 <符号xml实体。< p>

#4


1  

All XML tags have to be closed, either like <tag></tag> or <tag />.

必须关闭所有XML标记,例如

If you want the <space> tag to be parsed as the text value of a tag, and not as a child tag, use &lt; and &gt; instead of < and >:

如果希望将 标记解析为标记的文本值,而不是作为子标记,请使用<和>而不是 <和> :

&lt;space&gt;

#5


0  

I would isolate the solution to your problem into a method and deal with it simply for now. After all, you may not have control over the correctness of the message content.

我会将你的问题的解决方案分离成一个方法,并暂时处理它。毕竟,您可能无法控制邮件内容的正确性。

private static String getMessage(String msg){
    return msg.substring(msg.indexOf("<msg>")+5, msg.lastIndexOf("</msg>"));
}//method

You may enhance it later, as more use cases become available.

随着更多用例可用,您可以稍后对其进行增强。

Edit: If someone puts an "msg" element in the content, then it still works

编辑:如果有人在内容中放入“msg”元素,那么它仍然有效

#1


1  

Firstly the sample xml is not really xml as the "space" tag is not closed.

首先,样本xml不是真正的xml,因为“space”标记未关闭。

Secondly, it looks like the reason for not wanting to parse the "space" tag is because it's not really xml - just text that looks like xml. The text should be either escaped/encoded or enclosed in CDATA tags.

其次,看起来不想解析“space”标签的原因是因为它不是真正的xml - 只是看起来像xml的文本。文本应该转义/编码或包含在CDATA标记中。

Lastly - if what you want to parse really is xml and you only want to parse the first level tags. I wouldn't bother with a real XML parser - i'd create my own ultra-simple parser - all it has to do is parse 1st level nodes - that shouldn't be too hard.

最后 - 如果您想要解析的是xml,并且您只想解析第一级标记。我不打算使用真正的XML解析器 - 我会创建自己的超简单解析器 - 它所要做的就是解析一级节点 - 这不应该太难。

Good luck!

#2


4  

You can't make a DTD that makes buggy XML magically not buggy. The XML is not well-formed, so it can never be valid as well-formedness is a prerequisite of validity (validity isn't even important here AFAICT). It's analogous to how the words in an English sentence have to all be English words before it can be a gramatically-correct English sentence.

你不能制作一个DTD,使错误的XML神奇地没有错误。 XML格式不正确,所以它永远不会有效,因为良好的形成是有效性的先决条件(有效性在AFAICT中甚至不重要)。它类似于英语句子中的单词如何都是英语单词,然后才能成为一个重要的英语句子。

<space> is not closed. It should either have a following </space> inside the <msg>, be replaced with <space/> or if by saying you don't want it to be paresed you mean you want the actual text "<space>" in there, then you should encode it as such (i.e. &lt;space&gt;).

未关闭。它应该在 中有一个跟随 ,用 替换,或者如果说你不想让它被paresed你意味着你想要那里的实际文本“ ” ,那么你应该这样编码(即< space>)。

#3


3  

DTD can't help you with this problem. DTD is by no means required (though it is quite handy to have it).

DTD无法帮助您解决此问题。 DTD绝不是必需的(尽管它非常方便)。

The document you posted above is not a valid XML document. Period. That's the way it is, and no reasonable XML parser will parse it for you without raising the error.

您在上面发布的文档不是有效的XML文档。期。就是这样,没有合理的XML解析器会为你解析它而不会引发错误。

What you can do though is to substitute < symbol with a &lt; XML entity.

你可以做的是用<&替换 <符号xml实体。< p>

#4


1  

All XML tags have to be closed, either like <tag></tag> or <tag />.

必须关闭所有XML标记,例如

If you want the <space> tag to be parsed as the text value of a tag, and not as a child tag, use &lt; and &gt; instead of < and >:

如果希望将 标记解析为标记的文本值,而不是作为子标记,请使用<和>而不是 <和> :

&lt;space&gt;

#5


0  

I would isolate the solution to your problem into a method and deal with it simply for now. After all, you may not have control over the correctness of the message content.

我会将你的问题的解决方案分离成一个方法,并暂时处理它。毕竟,您可能无法控制邮件内容的正确性。

private static String getMessage(String msg){
    return msg.substring(msg.indexOf("<msg>")+5, msg.lastIndexOf("</msg>"));
}//method

You may enhance it later, as more use cases become available.

随着更多用例可用,您可以稍后对其进行增强。

Edit: If someone puts an "msg" element in the content, then it still works

编辑:如果有人在内容中放入“msg”元素,那么它仍然有效