I know it's better to use DOM for this purpose but let's try to extract the text in this way:
我知道为此目的使用DOM会更好,但让我们尝试以这种方式提取文本:
<?php
$html=<<<EOD
<html>
<head>
</head>
<body>
<p>Some text</p>
</body>
</html>
EOD;
preg_match('/<body.*?>/', $html, $matches, PREG_OFFSET_CAPTURE);
if (empty($matches))
exit;
$matched_body_start_tag = $matches[0][0];
$index_of_body_start_tag = $matches[0][1];
$index_of_body_end_tag = strpos($html, '</body>');
$body = substr(
$html,
$index_of_body_start_tag + strlen($matched_body_start_tag),
$index_of_body_end_tag - $index_of_body_start_tag + strlen($matched_body_start_tag)
);
echo $body;
The result can be seen here: http://ideone.com/vH2FZ
结果可以在这里看到:http://ideone.com/vH2FZ
As you can see, I am getting more text than expected.
如您所见,我收到的文字超出预期。
There is something I don't understand, to get the correct length for the substr($string, $start, $length)
function, I am using:
有些东西我不明白,为了获得substr($ string,$ start,$ length)函数的正确长度,我正在使用:
$index_of_body_end_tag - $index_of_body_start_tag + strlen($matched_body_start_tag)
I don't see anything wrong with this formula.
我没有看到这个公式有什么问题。
Could somebody kindly suggest where the problem is?
有人可以建议问题出在哪里吗?
Many thanks to you all.
非常感谢大家。
EDIT:
Thank you very very much to all of you. There is just a bug in my brain. After reading your answers, I now understand what the problem is, it should either be:
非常感谢你们所有人。我脑子里只有一个小虫。在阅读完答案后,我现在明白了问题所在,它应该是:
$index_of_body_end_tag - ($index_of_body_start_tag + strlen($matched_body_start_tag));
Or:
$index_of_body_end_tag - $index_of_body_start_tag - strlen($matched_body_start_tag);
4 个解决方案
#1
11
The problem is that your string have new lines where . in the pattern only matches single lines, you need to add /s modifier to make . to match multi-lines
问题是你的字符串有新行。在模式中只匹配单行,您需要添加/ s修饰符来制作。匹配多行
Here is my solution, I prefer it this way.
这是我的解决方案,我更喜欢这种方式。
<?php
$html=<<<EOD
<html>
<head>
</head>
<body buu="grger" ga="Gag">
<p>Some text</p>
</body>
</html>
EOD;
// get anything between <body> and </body> where <body can="have_as many" attributes="as required">
if (preg_match('/(?:<body[^>]*>)(.*)<\/body>/isU', $html, $matches)) {
$body = $matches[1];
}
// outputing all matches for debugging purposes
var_dump($matches);
?>
Edit: I am updating my answer to provide you with better explanation why your code fails.
编辑:我正在更新我的答案,为您提供更好的解释为什么您的代码失败。
You have this string:
你有这个字符串:
<html>
<head>
</head>
<body>
<p>Some text</p>
</body>
</html>
Everything seems to be fine with it but actually you have non-print characters (new line characters) on each line. You have 53 printable characters and 7 non printable (new lines, \n == 2 characters actually for each new line).
一切似乎都很好,但实际上每行都有非打印字符(换行符)。您有53个可打印字符和7个不可打印字符(新行,实际上每行为\ n == 2个字符)。
When you reach this part of the code:
当你到达这部分代码时:
$index_of_body_end_tag = strpos($html, '</body>');
You get the correct position of </body> (starting at position 51) but this counts the new lines.
你得到 的正确位置(从第51位开始)但这会计算新的行。
So when you reach this line of code:
所以当你到达这行代码时:
$index_of_body_start_tag + strlen($matched_body_start_tag)
It it evaluated to 31 (new lines included), and:
评估为31(包括新行),并且:
$index_of_body_end_tag - $index_of_body_start_tag + strlen($matched_body_start_tag)
It is evaluated to 51 - 25 + 6 = 32 (characters you have to read) but you only have 16 printable characters of text between <body> and </body> and 4 non printable characters (new line after <body> and new line before </body>). And here is the problem, you have to group the calculation (prioritize) like so:
它被评估为51 - 25 + 6 = 32(您必须阅读的字符)但在和 之间只有16个可打印的文本字符和4个不可打印的字符(和new之后的新行)在 之前的行。这就是问题,您必须将计算(优先级)分组,如下所示:
$index_of_body_end_tag - ($index_of_body_start_tag + strlen($matched_body_start_tag))
evaluated to 51 - (25 + 6) = 51 - 31 = 20 (16 + 4).
评价为51-(25 + 6)= 51-31 = 20(16 + 4)。
:) Hope this helps you to understand why prioritizing is important. (Sorry for misleading you about newlines it is only valid in regex example I gave above).
:)希望这可以帮助您理解为什么优先排序很重要。 (很抱歉误导你的新行,它只在我上面给出的正则表达式中有效)。
#2
4
Personally, I wouldn't use regex.
就个人而言,我不会使用正则表达式。
<?php
$html = <<<EOD
<html>
<head>
<title>Example</title>
</head>
<body>
<h1>foobar</h1>
</body>
</html>
EOD;
$s = strpos($html, '<body>') + strlen('<body>');
$f = '</body>';
echo trim(substr($html, $s, strpos($html, $f) - $s));
?>
returns <h1>foobar</h1>
#3
2
The problem is in your substr
computation of the ending index. You should substract all the way:
问题出在你的结束索引的substr计算中。你应该一路减去:
$index_of_body_end_tag - $index_of_body_start_tag - strlen($matched_body_start_tag)
But you are doing:
但是你在做:
+ strlen($matched_body_start_tag)
That said, it seems a little overkill considering you can do it using preg_match
only. You just need to make sure you match across new lines, using the s
modifier:
也就是说,考虑到你只能使用preg_match来做这件事似乎有点矫枉过正。您只需要确保使用s修饰符匹配新行:
preg_match('/<body[^>]*>(.*?)<\/body>/s', $html, $matches);
echo $matches[1];
Outputs:
<p>Some text</p>
#4
1
Somebodys probably already found your error, i didn't read all the replys.
The algebra is wrong.
Somebodys可能已经发现了你的错误,我没有阅读所有回复。代数是错的。
代码在这里
Btw, first time seeing ideone.com, thats pretty cool.
顺便说一句,第一次看到ideone.com,这很酷。
$body = substr(
$html,
$index_of_body_start_tag + strlen($matched_body_start_tag),
$index_of_body_end_tag - ($index_of_body_start_tag + strlen($matched_body_start_tag))
);
or ..
$body = substr(
$html,
$index_of_body_start_tag + strlen($matched_body_start_tag),
$index_of_body_end_tag - $index_of_body_start_tag - strlen($matched_body_start_tag)
);
#1
11
The problem is that your string have new lines where . in the pattern only matches single lines, you need to add /s modifier to make . to match multi-lines
问题是你的字符串有新行。在模式中只匹配单行,您需要添加/ s修饰符来制作。匹配多行
Here is my solution, I prefer it this way.
这是我的解决方案,我更喜欢这种方式。
<?php
$html=<<<EOD
<html>
<head>
</head>
<body buu="grger" ga="Gag">
<p>Some text</p>
</body>
</html>
EOD;
// get anything between <body> and </body> where <body can="have_as many" attributes="as required">
if (preg_match('/(?:<body[^>]*>)(.*)<\/body>/isU', $html, $matches)) {
$body = $matches[1];
}
// outputing all matches for debugging purposes
var_dump($matches);
?>
Edit: I am updating my answer to provide you with better explanation why your code fails.
编辑:我正在更新我的答案,为您提供更好的解释为什么您的代码失败。
You have this string:
你有这个字符串:
<html>
<head>
</head>
<body>
<p>Some text</p>
</body>
</html>
Everything seems to be fine with it but actually you have non-print characters (new line characters) on each line. You have 53 printable characters and 7 non printable (new lines, \n == 2 characters actually for each new line).
一切似乎都很好,但实际上每行都有非打印字符(换行符)。您有53个可打印字符和7个不可打印字符(新行,实际上每行为\ n == 2个字符)。
When you reach this part of the code:
当你到达这部分代码时:
$index_of_body_end_tag = strpos($html, '</body>');
You get the correct position of </body> (starting at position 51) but this counts the new lines.
你得到 的正确位置(从第51位开始)但这会计算新的行。
So when you reach this line of code:
所以当你到达这行代码时:
$index_of_body_start_tag + strlen($matched_body_start_tag)
It it evaluated to 31 (new lines included), and:
评估为31(包括新行),并且:
$index_of_body_end_tag - $index_of_body_start_tag + strlen($matched_body_start_tag)
It is evaluated to 51 - 25 + 6 = 32 (characters you have to read) but you only have 16 printable characters of text between <body> and </body> and 4 non printable characters (new line after <body> and new line before </body>). And here is the problem, you have to group the calculation (prioritize) like so:
它被评估为51 - 25 + 6 = 32(您必须阅读的字符)但在和 之间只有16个可打印的文本字符和4个不可打印的字符(和new之后的新行)在 之前的行。这就是问题,您必须将计算(优先级)分组,如下所示:
$index_of_body_end_tag - ($index_of_body_start_tag + strlen($matched_body_start_tag))
evaluated to 51 - (25 + 6) = 51 - 31 = 20 (16 + 4).
评价为51-(25 + 6)= 51-31 = 20(16 + 4)。
:) Hope this helps you to understand why prioritizing is important. (Sorry for misleading you about newlines it is only valid in regex example I gave above).
:)希望这可以帮助您理解为什么优先排序很重要。 (很抱歉误导你的新行,它只在我上面给出的正则表达式中有效)。
#2
4
Personally, I wouldn't use regex.
就个人而言,我不会使用正则表达式。
<?php
$html = <<<EOD
<html>
<head>
<title>Example</title>
</head>
<body>
<h1>foobar</h1>
</body>
</html>
EOD;
$s = strpos($html, '<body>') + strlen('<body>');
$f = '</body>';
echo trim(substr($html, $s, strpos($html, $f) - $s));
?>
returns <h1>foobar</h1>
#3
2
The problem is in your substr
computation of the ending index. You should substract all the way:
问题出在你的结束索引的substr计算中。你应该一路减去:
$index_of_body_end_tag - $index_of_body_start_tag - strlen($matched_body_start_tag)
But you are doing:
但是你在做:
+ strlen($matched_body_start_tag)
That said, it seems a little overkill considering you can do it using preg_match
only. You just need to make sure you match across new lines, using the s
modifier:
也就是说,考虑到你只能使用preg_match来做这件事似乎有点矫枉过正。您只需要确保使用s修饰符匹配新行:
preg_match('/<body[^>]*>(.*?)<\/body>/s', $html, $matches);
echo $matches[1];
Outputs:
<p>Some text</p>
#4
1
Somebodys probably already found your error, i didn't read all the replys.
The algebra is wrong.
Somebodys可能已经发现了你的错误,我没有阅读所有回复。代数是错的。
代码在这里
Btw, first time seeing ideone.com, thats pretty cool.
顺便说一句,第一次看到ideone.com,这很酷。
$body = substr(
$html,
$index_of_body_start_tag + strlen($matched_body_start_tag),
$index_of_body_end_tag - ($index_of_body_start_tag + strlen($matched_body_start_tag))
);
or ..
$body = substr(
$html,
$index_of_body_start_tag + strlen($matched_body_start_tag),
$index_of_body_end_tag - $index_of_body_start_tag - strlen($matched_body_start_tag)
);