I want ask what could be the mistake i am doing in this code. I am currently trying to find the first occurrence of an image tag or an object tag then return a piece of html if it matches one. Currently, I can get the image tag, but unfortunately I can't seem to have any results on object tag.
我想问一下我在这段代码中犯的错误。我目前正在尝试找到第一次出现的图像标签或对象标签然后返回一段html,如果它匹配一个。目前,我可以获取图像标记,但遗憾的是我似乎无法在对象标记上获得任何结果。
I am thought, I am doing some mistake in my regex pattern or something. Hope requirement is clear enough for you to understand thanks.
我想,我在我的正则表达式模式中做了一些错误。希望要求足够清楚,让您理解感谢。
My code here:
我的代码在这里:
function get_first_image(){
global $post, $posts;
$first_img = '';
ob_start();
ob_end_clean();
$output = preg_match_all('/<img.+src=[\'"]([^\'"]+)[\'"].*>/i', $post->post_content, $matches) || preg_match_all('/<object[0-9 a-z_?*=\":\-\/\.#\,<>\\n\\r\\t]+<\/object>/smi', $post->post_content, $matches);
$first_img = $matches [1] [0];
if(empty($first_img)){ //Defines a default image
$mediaSearch = preg_match_all('/<object[0-9 a-z_?*=\":\-\/\.#\,<>\\n\\r\\t]+<\/object>/smi', $post->post_content, $matches2);
$first_media = $matches2 [1] [0];
$first_img = "/images/default.jpg";
}
if(!empty($first_img)){
$result = "<div class=\"alignleft\"><img src=\"$first_img\" style=\"max-width: 200px;\" /></div>";
}
if(!empty($first_media)){
$result = "<p>" . $first_media . "</p>";
}
return $result;
}
2 个解决方案
#1
While regular expressions can be good for a large variety of tasks, I find it usually falls short when parsing HTML DOM. The problem with HTML is that the structure of your document is so variable that it is hard to accurately (and by accurately I mean 100% success rate with no false positive) extract a tag.
虽然正则表达式可以适用于各种各样的任务,但我发现在解析HTML DOM时通常会出现问题。 HTML的问题在于,文档的结构变化很大,难以准确(并且准确地说,我的意思是100%的成功率,没有误报)提取标签。
What I recommend you do is use a DOM parser such as SimpleHTML
and use it as such:
我建议你做的是使用一个DOM解析器,如SimpleHTML,并使用它:
function get_first_image(){
global $post, $posts;
require_once('SimpleHTML.class.php')
$post_dom = str_get_dom($post->post_content);
$first_img = $post_dom->find('img', 0);
if($first_img !== null) {
$first_img->style = $first_img->style . ';max-width: 200px';
return '<div class="alignleft">' . $first_img->outertext . '</div>';
} else {
$first_obj = $post_dom->find('object', 0);
if($first_obj !== null) {
return '<p>' . $first_obj->outertext . '</p>';
}
}
return '<div class="alignleft"><img src="/images/default.jpg" style="max-width: 200px;" /></div>';
}
Some may think this is overkill, but in the end, it will be easier to maintain and also allows for more extensibility. For example, using the DOM parser, I can add to the styles of your current image.
有些人可能认为这样做太过分了,但最终,维护起来会更容易,并且可以提供更多的可扩展性。例如,使用DOM解析器,我可以添加到当前图像的样式。
A regular expression could be devised to achieve the same goal but would be limited in such way that it would force the style
attribute to be after the src
or the opposite, and to overcome this limitation would add more complexity to the regular expression.
可以设计一个正则表达式来实现相同的目标,但是会限制它将强制style属性在src之后或相反的方式,并且克服这个限制会增加正则表达式的复杂性。
Also, consider the following. To properly match an <img>
tag using regular expressions and to get only the src
attribute (captured in group 2), you need the following regular expression:
另外,请考虑以下内容。要使用正则表达式正确匹配标记并仅获取src属性(在第2组中捕获),您需要以下正则表达式:
<\s*?img\s+?[^>]*?\s*?src\s*?=\s*?(["'])((\\?+.)*?)\1[^>]*?>
And then again, the above can fail if:
如果出现以下情况,上述情况可能会失败:
- The attribute or tag name is in capital and the
i
modifier is not used. - Quotes are not used around the
src
attribute. - Another attribute then
src
uses the>
character somewhere in their value. - Some other reason I have not foreseen.
属性或标记名称为大写,并且不使用i修饰符。
src属性周围没有使用引号。
然后src的另一个属性在其值的某处使用>字符。
我没有预料到的其他一些原因。
So again, simply don't use regular expressions to parse a dom document.
因此,再次,不要使用正则表达式来解析dom文档。
#2
Try this: (You need to define what you want to get in the matches array)
试试这个:(你需要在匹配数组中定义你想要的东西)
function get_first_image(){
global $post, $posts;
$first_img = '';
ob_start();
ob_end_clean();
$output = preg_match_all('/<img.+src=[\'"]([^\'"]+)[\'"].*>/i', $post->post_content, $matches) || preg_match_all('(/<object[0-9 a-z_?*=\":\-\/\.#\,<>\\n\\r\\t]+<\/object>)/smi', $post->post_content, $matches);
$first_img = $matches [1] [0];
if(empty($first_img)){ //Defines a default image
$mediaSearch = preg_match_all('/<object[0-9 a-z_?*=\":\-\/\.#\,<>\\n\\r\\t]+<\/object>/smi', $post->post_content, $matches2);
$first_media = $matches2 [1] [0];
$first_img = "/images/default.jpg";
}
if(!empty($first_img)){
$result = "<div class=\"alignleft\"><img src=\"$first_img\" style=\"max-width: 200px;\" /></div>";
}
if(!empty($first_media)){
$result = "<p>" . $first_media . "</p>";
}
return $result;
}
#1
While regular expressions can be good for a large variety of tasks, I find it usually falls short when parsing HTML DOM. The problem with HTML is that the structure of your document is so variable that it is hard to accurately (and by accurately I mean 100% success rate with no false positive) extract a tag.
虽然正则表达式可以适用于各种各样的任务,但我发现在解析HTML DOM时通常会出现问题。 HTML的问题在于,文档的结构变化很大,难以准确(并且准确地说,我的意思是100%的成功率,没有误报)提取标签。
What I recommend you do is use a DOM parser such as SimpleHTML
and use it as such:
我建议你做的是使用一个DOM解析器,如SimpleHTML,并使用它:
function get_first_image(){
global $post, $posts;
require_once('SimpleHTML.class.php')
$post_dom = str_get_dom($post->post_content);
$first_img = $post_dom->find('img', 0);
if($first_img !== null) {
$first_img->style = $first_img->style . ';max-width: 200px';
return '<div class="alignleft">' . $first_img->outertext . '</div>';
} else {
$first_obj = $post_dom->find('object', 0);
if($first_obj !== null) {
return '<p>' . $first_obj->outertext . '</p>';
}
}
return '<div class="alignleft"><img src="/images/default.jpg" style="max-width: 200px;" /></div>';
}
Some may think this is overkill, but in the end, it will be easier to maintain and also allows for more extensibility. For example, using the DOM parser, I can add to the styles of your current image.
有些人可能认为这样做太过分了,但最终,维护起来会更容易,并且可以提供更多的可扩展性。例如,使用DOM解析器,我可以添加到当前图像的样式。
A regular expression could be devised to achieve the same goal but would be limited in such way that it would force the style
attribute to be after the src
or the opposite, and to overcome this limitation would add more complexity to the regular expression.
可以设计一个正则表达式来实现相同的目标,但是会限制它将强制style属性在src之后或相反的方式,并且克服这个限制会增加正则表达式的复杂性。
Also, consider the following. To properly match an <img>
tag using regular expressions and to get only the src
attribute (captured in group 2), you need the following regular expression:
另外,请考虑以下内容。要使用正则表达式正确匹配标记并仅获取src属性(在第2组中捕获),您需要以下正则表达式:
<\s*?img\s+?[^>]*?\s*?src\s*?=\s*?(["'])((\\?+.)*?)\1[^>]*?>
And then again, the above can fail if:
如果出现以下情况,上述情况可能会失败:
- The attribute or tag name is in capital and the
i
modifier is not used. - Quotes are not used around the
src
attribute. - Another attribute then
src
uses the>
character somewhere in their value. - Some other reason I have not foreseen.
属性或标记名称为大写,并且不使用i修饰符。
src属性周围没有使用引号。
然后src的另一个属性在其值的某处使用>字符。
我没有预料到的其他一些原因。
So again, simply don't use regular expressions to parse a dom document.
因此,再次,不要使用正则表达式来解析dom文档。
#2
Try this: (You need to define what you want to get in the matches array)
试试这个:(你需要在匹配数组中定义你想要的东西)
function get_first_image(){
global $post, $posts;
$first_img = '';
ob_start();
ob_end_clean();
$output = preg_match_all('/<img.+src=[\'"]([^\'"]+)[\'"].*>/i', $post->post_content, $matches) || preg_match_all('(/<object[0-9 a-z_?*=\":\-\/\.#\,<>\\n\\r\\t]+<\/object>)/smi', $post->post_content, $matches);
$first_img = $matches [1] [0];
if(empty($first_img)){ //Defines a default image
$mediaSearch = preg_match_all('/<object[0-9 a-z_?*=\":\-\/\.#\,<>\\n\\r\\t]+<\/object>/smi', $post->post_content, $matches2);
$first_media = $matches2 [1] [0];
$first_img = "/images/default.jpg";
}
if(!empty($first_img)){
$result = "<div class=\"alignleft\"><img src=\"$first_img\" style=\"max-width: 200px;\" /></div>";
}
if(!empty($first_media)){
$result = "<p>" . $first_media . "</p>";
}
return $result;
}