PHP SimpleXML不保存XML属性中的换行符

时间:2022-10-27 07:24:45

I have to parse externally provided XML that has attributes with line breaks in them. Using SimpleXML, the line breaks seem to be lost. According to another * question, line breaks should be valid (even though far less than ideal!) for XML.

我必须解析外部提供的具有换行符属性的XML。使用SimpleXML,换行符似乎丢失了。根据另一个*问题,对于XML来说,换行符应该是有效的(尽管远远不够理想!)

Why are they lost? [edit] And how can I preserve them? [/edit]

为什么他们输了?[编辑]如何保存它们?(/编辑)

Here is a demo file script (note that when the line breaks are not in an attribute they are preserved).

这是一个演示文件脚本(注意,当换行符不在属性中时,它们将被保留)。

PHP File with embedded XML

带有嵌入式XML的PHP文件。

$xml = <<<XML
<?xml version="1.0" encoding="utf-8"?>
<Rows>
    <data Title='Data Title' Remarks='First line of the row.
Followed by the second line.
Even a third!' />
    <data Title='Full Title' Remarks='None really'>First line of the row.
Followed by the second line.
Even a third!</data>
</Rows>
XML;

$xml = new SimpleXMLElement( $xml );
print '<pre>'; print_r($xml); print '</pre>';

Output from print_r

输出print_r

SimpleXMLElement Object
(
    [data] => Array
        (
            [0] => SimpleXMLElement Object
                (
                    [@attributes] => Array
                        (
                            [Title] => Data Title
                            [Remarks] => First line of the row. Followed by the second line. Even a third!
                        )

                )

            [1] => First line of the row.
Followed by the second line.
Even a third!
        )

)

6 个解决方案

#1


4  

The entity for a new line is &#10;. I played with your code until I found something that did the trick. It's not very elegant, I warn you:

新行的实体是#10;我一直在玩弄你的代码,直到我发现了一些有用的东西。我警告你,它不太优雅:

//First remove any indentations:
$xml = str_replace("     ","", $xml);
$xml = str_replace("\t","", $xml);

//Next replace unify all new-lines into unix LF:
$xml = str_replace("\r","\n", $xml);
$xml = str_replace("\n\n","\n", $xml);

//Next replace all new lines with the unicode:
$xml = str_replace("\n","&#10;", $xml);

Finally, replace any new line entities between >< with a new line:
$xml = str_replace(">&#10;<",">\n<", $xml);

The assumption, based on your example, is that any new lines that occur inside a node or attribute will have more text on the next line, not a < to open a new element.

基于示例的假设是,在节点或属性中出现的任何新行在下一行中将有更多的文本,而不是打开新元素的<。

This of course would fail if your next line had some text that was wrapped in a line-level element.

如果您的下一行包含了一些包含在行级元素中的文本,那么这当然会失败。

#2


12  

Using SimpleXML, the line breaks seem to be lost.

使用SimpleXML,换行符似乎丢失了。

Yes, that is expected... in fact it is required of any conformant XML parser that newlines in attribute values represent simple spaces. See attribute value normalisation in the XML spec.

是的,预计……事实上,任何符合规范的XML解析器都需要属性值中的换行表示简单的空格。参见XML规范中的属性值规范化。

If there was supposed to be a real newline character in the attribute value, the XML should have included a &#10; character reference instead of a raw newline.

如果属性值中应该有一个真正的换行字符,那么XML应该包含一个 字符引用而不是原始的换行符。

#3


1  

Assuming $xmlData is your XML string before it is sent to the parser, this should replace all newlines in attributes with the correct entity. I had the issue with XML coming from SQL Server.

假设$xmlData是发送到解析器之前的XML字符串,那么应该用正确的实体替换属性中的所有新行。我遇到了来自SQL Server的XML问题。

$parts = explode("<", $xmlData); //split over <
array_shift($parts); //remove the blank array element
$newParts = array(); //create array for storing new parts
foreach($parts as $p)
{
    list($attr,$other) = explode(">", $p, 2); //get attribute data into $attr
    $attr = str_replace("\r\n", "&#10;", $attr); //do the replacement
    $newParts[] = $attr.">".$other; // put parts back together
}
$xmlData = "<".implode("<", $newParts); // put parts back together prefixing with <

Probably can be done more simply with a regex, but that's not a strong point for me.

使用regex可能可以做得更简单,但这对我来说不是一个优点。

#4


1  

Here is code to replace the new lines with the appropriate character reference in that particular XML fragment. Run this code prior to parsing.

下面是用该特定XML片段中的适当字符引用替换新行的代码。在解析之前运行此代码。

$replaceFunction = function ($matches) {
    return str_replace("\n", "&#10;", $matches[0]);
};
$xml = preg_replace_callback(
    "/<data Title='[^']+' Remarks='[^']+'/i",
    $replaceFunction, $xml);

#5


0  

This is what worked for me:

这就是对我起作用的地方:

First, get the xml as a string:

首先,将xml作为字符串:

    $xml = file_get_contents($urlXml);

Then do the replacement:

然后做替换:

    $xml = str_replace(".\xe2\x80\xa9<as:eol/>",".\n\n<as:eol/>",$xml);

The "." and "< as:eol/ >" were there because I needed to add breaks in that case. The new lines "\n" can be replaced with whatever you like.

“.”和“< as:eol/ >”之所以出现,是因为在这种情况下我需要添加断点。新的“\n”行可以替换为您喜欢的任何行。

After replacing, just load the xml-string as a SimpleXMLElement object:

替换后,只需将xml字符串作为SimpleXMLElement对象加载:

    $xmlo = new SimpleXMLElement( $xml );

Et Voilà

等好了

#6


0  

Well, this question is old but like me, someone might come to this page eventually. I had slightly different approach and I think the most elegant out of these mentioned.

这个问题已经过时了,但和我一样,最终可能会有人来问这个问题。我有稍微不同的方法,我认为其中最优雅的。

Inside the xml, you put some unique word which you will use for new line.

在xml中,您将放置一些惟一的单词,用于新行。

Change xml to

改变xml

<data Title='Data Title' Remarks='First line of the row. \n
Followed by the second line. \n
Even a third!' />

And then when you get path to desired node in SimpleXML in string output write something like this:

然后,当您在SimpleXML中获得字符串输出中所需节点的路径时,可以这样写:

$findme  = '\n';
$pos = strpos($output, $findme);
if($pos!=0)
{
$output = str_replace("\n","<br/>",$output);

It doesn't have to be '\n, it can be any unique char.

它不必是'\n,它可以是任何唯一的char。

#1


4  

The entity for a new line is &#10;. I played with your code until I found something that did the trick. It's not very elegant, I warn you:

新行的实体是#10;我一直在玩弄你的代码,直到我发现了一些有用的东西。我警告你,它不太优雅:

//First remove any indentations:
$xml = str_replace("     ","", $xml);
$xml = str_replace("\t","", $xml);

//Next replace unify all new-lines into unix LF:
$xml = str_replace("\r","\n", $xml);
$xml = str_replace("\n\n","\n", $xml);

//Next replace all new lines with the unicode:
$xml = str_replace("\n","&#10;", $xml);

Finally, replace any new line entities between >< with a new line:
$xml = str_replace(">&#10;<",">\n<", $xml);

The assumption, based on your example, is that any new lines that occur inside a node or attribute will have more text on the next line, not a < to open a new element.

基于示例的假设是,在节点或属性中出现的任何新行在下一行中将有更多的文本,而不是打开新元素的<。

This of course would fail if your next line had some text that was wrapped in a line-level element.

如果您的下一行包含了一些包含在行级元素中的文本,那么这当然会失败。

#2


12  

Using SimpleXML, the line breaks seem to be lost.

使用SimpleXML,换行符似乎丢失了。

Yes, that is expected... in fact it is required of any conformant XML parser that newlines in attribute values represent simple spaces. See attribute value normalisation in the XML spec.

是的,预计……事实上,任何符合规范的XML解析器都需要属性值中的换行表示简单的空格。参见XML规范中的属性值规范化。

If there was supposed to be a real newline character in the attribute value, the XML should have included a &#10; character reference instead of a raw newline.

如果属性值中应该有一个真正的换行字符,那么XML应该包含一个 字符引用而不是原始的换行符。

#3


1  

Assuming $xmlData is your XML string before it is sent to the parser, this should replace all newlines in attributes with the correct entity. I had the issue with XML coming from SQL Server.

假设$xmlData是发送到解析器之前的XML字符串,那么应该用正确的实体替换属性中的所有新行。我遇到了来自SQL Server的XML问题。

$parts = explode("<", $xmlData); //split over <
array_shift($parts); //remove the blank array element
$newParts = array(); //create array for storing new parts
foreach($parts as $p)
{
    list($attr,$other) = explode(">", $p, 2); //get attribute data into $attr
    $attr = str_replace("\r\n", "&#10;", $attr); //do the replacement
    $newParts[] = $attr.">".$other; // put parts back together
}
$xmlData = "<".implode("<", $newParts); // put parts back together prefixing with <

Probably can be done more simply with a regex, but that's not a strong point for me.

使用regex可能可以做得更简单,但这对我来说不是一个优点。

#4


1  

Here is code to replace the new lines with the appropriate character reference in that particular XML fragment. Run this code prior to parsing.

下面是用该特定XML片段中的适当字符引用替换新行的代码。在解析之前运行此代码。

$replaceFunction = function ($matches) {
    return str_replace("\n", "&#10;", $matches[0]);
};
$xml = preg_replace_callback(
    "/<data Title='[^']+' Remarks='[^']+'/i",
    $replaceFunction, $xml);

#5


0  

This is what worked for me:

这就是对我起作用的地方:

First, get the xml as a string:

首先,将xml作为字符串:

    $xml = file_get_contents($urlXml);

Then do the replacement:

然后做替换:

    $xml = str_replace(".\xe2\x80\xa9<as:eol/>",".\n\n<as:eol/>",$xml);

The "." and "< as:eol/ >" were there because I needed to add breaks in that case. The new lines "\n" can be replaced with whatever you like.

“.”和“< as:eol/ >”之所以出现,是因为在这种情况下我需要添加断点。新的“\n”行可以替换为您喜欢的任何行。

After replacing, just load the xml-string as a SimpleXMLElement object:

替换后,只需将xml字符串作为SimpleXMLElement对象加载:

    $xmlo = new SimpleXMLElement( $xml );

Et Voilà

等好了

#6


0  

Well, this question is old but like me, someone might come to this page eventually. I had slightly different approach and I think the most elegant out of these mentioned.

这个问题已经过时了,但和我一样,最终可能会有人来问这个问题。我有稍微不同的方法,我认为其中最优雅的。

Inside the xml, you put some unique word which you will use for new line.

在xml中,您将放置一些惟一的单词,用于新行。

Change xml to

改变xml

<data Title='Data Title' Remarks='First line of the row. \n
Followed by the second line. \n
Even a third!' />

And then when you get path to desired node in SimpleXML in string output write something like this:

然后,当您在SimpleXML中获得字符串输出中所需节点的路径时,可以这样写:

$findme  = '\n';
$pos = strpos($output, $findme);
if($pos!=0)
{
$output = str_replace("\n","<br/>",$output);

It doesn't have to be '\n, it can be any unique char.

它不必是'\n,它可以是任何唯一的char。