如何在xml标记之间提取多语言字符串

时间:2022-01-26 07:10:44

I am trying to extract text in between an xml tag. The text in between the tag is multilingual. For example:

我试图在xml标记之间提取文本。标签之间的文本是多语言的。例如:

<string xmlns="http://schemas.microsoft.com/2003/10/Serialization/">
    तुम्हारा नाम क्या है
</string>

I have tried to google it and got a few regexes but that didn't work Here is one I have tried:

我试过谷歌,并获得了一些regex,但在这里不起作用的是我试过的一个:

String str = "<string xmlns="+
    "http://schemas.microsoft.com/2003/10/Serialization/"+">"+
    "तुम्हारा नाम क्या है"+"</string>";

final Pattern pattern = Pattern.compile("<String xmlns="+
    "http://schemas.microsoft.com/2003/10/Serialization/"+">(.+?)</string>");

final Matcher matcher = pattern.matcher(str);
matcher.find();
System.out.println(matcher.group(1));

The given String format is

给定的字符串格式是

<string xmlns="http://schemas.microsoft.com/2003/10/Serialization/">
    तुम्हारा नाम क्या है
</string>

and the expected output is:

期望的输出是:

तुम्हारा नाम क्या है

It's giving me an error

它给了我一个错误

2 个解决方案

#1


4  

This pattern matches expected part and $1 gives you expected result:

此模式匹配预期部分,$1给出您预期的结果:

/<string .*?>(.*?)<\\/string>/

Online Demo

But highly recommended to stop doing that by regex ..! You have to find a HTML parser in JAVA and simply grab the content of <string> tag.

但是强烈建议不要再这样做了。您必须在JAVA中找到一个HTML解析器,并简单地获取 标记的内容。

#2


0  

Don’t use regular expressions for parsing XML. It will work in a few cases, but eventually it will fail. See Can you provide some examples of why it is hard to parse XML and HTML with a regex? for a full explanation.

不要使用正则表达式解析XML。它在一些情况下会起作用,但最终会失败。可以提供一些示例来说明为什么使用regex很难解析XML和HTML ?为一个完整的解释。

The easiest way to extract an element’s string content is with XPath:

提取元素字符串内容的最简单方法是使用XPath:

String contents =
    XPathFactory.newInstance().newXPath().evaluate(
        "//*[local-name()='string']",
        new InputSource(new StringReader(str)));

#1


4  

This pattern matches expected part and $1 gives you expected result:

此模式匹配预期部分,$1给出您预期的结果:

/<string .*?>(.*?)<\\/string>/

Online Demo

But highly recommended to stop doing that by regex ..! You have to find a HTML parser in JAVA and simply grab the content of <string> tag.

但是强烈建议不要再这样做了。您必须在JAVA中找到一个HTML解析器,并简单地获取 标记的内容。

#2


0  

Don’t use regular expressions for parsing XML. It will work in a few cases, but eventually it will fail. See Can you provide some examples of why it is hard to parse XML and HTML with a regex? for a full explanation.

不要使用正则表达式解析XML。它在一些情况下会起作用,但最终会失败。可以提供一些示例来说明为什么使用regex很难解析XML和HTML ?为一个完整的解释。

The easiest way to extract an element’s string content is with XPath:

提取元素字符串内容的最简单方法是使用XPath:

String contents =
    XPathFactory.newInstance().newXPath().evaluate(
        "//*[local-name()='string']",
        new InputSource(new StringReader(str)));