从字符串php regex中删除所有div标签?

时间:2022-08-27 16:20:02

I have a wysiwyg on a site. The problem is that the users are copy pasting a lot of data in to it leaving a lot of unclosed and improperly formatted div tags that are breaking the site layout.

我在网站上有一个所见即所得。问题是用户正在将大量数据复制到其中,留下了大量未打开且格式不正确的div标签,这些标签打破了网站布局。

Is there an easy an easy way to strip all occurrences of <div> and </div>?

是否有一种简单方法可以去除所有出现的

和 ?

str_replace won't work because some of the divs have styling and other things in them so it would need to account for <div style="some styling"> <div align="center"> etc.

str_replace不起作用,因为一些div有样式和其他东西,所以它需要考虑

等。

I'm guessing this could be done with a regular expression but I am total a total beginner when it comes to those.

我猜这可以用正则表达式完成,但是当涉及到这些时,我总是一个初学者。

4 个解决方案

#1


-1  

No. You do NOT ever parse/manipulate HTML with regexes.

不。您不会使用正则表达式解析/操纵HTML。

Regexes cannot be bargained with. They can't be reasoned with. They don't understand html, they don't grok xml. And they absolute will NOT stop until your DOM tree is dead.

正则表达式无法讨价还价。他们无法理解。他们不懂html,他们没有grok xml。他们绝对不会停止,直到你的DOM树死了。

You use htmlpurifier and/or DOM to manipulate the tree.

您使用htmlpurifier和/或DOM来操作树。

#2


5  

Better to use DOM for HTML parser but if you have no choice but to use RegEx then you can use it like this:

最好将DOM用于HTML解析器,但如果您别无选择,只能使用RegEx,那么您可以像这样使用它:

$patterns = array();
$patterns[0] = '/<div[^>]*>/';
$patterns[1] = '/<\/div>/';
$replacements = array();
$replacements[2] = '';
$replacements[1] = '';
echo preg_replace($patterns, $replacements, $html);

#3


0  

Here's a simplified example of how you could do it with PHP

这是一个简单的例子,说明如何使用PHP完成它

    <?php
    /**
     * Removes the divs because why not
     */
    function strip_divs(&$text, $id = 'html') {
      $replacements = array();
      worker($text, $replacements, $id);

      foreach ($replacements as $key => $val) {
        $text = mb_str_replace($key, $val, $text);
      }

      return $text;
    }

    function worker(&$body, &$replacements, $id) {
      static $call_count;
      if (empty($call_count)) {
        $call_count = array();
      }
      if (empty($call_count[$id])) {
        $call_count[$id] = 0;
      }

      if (mb_strpos($body, '</div>')) {
        $body = mb_str_replace('</div>', '', $body);
      }

      if (mb_strpos($body, '<di') !== FALSE) {
        $call_count[$id] ++;
        // Gets the important junk
        $rm               = '<di' . xml_get($body, '<di', '>') . '>';
        // Builds the replacements HTML
        $replacement_html = '';

        $next_id                       = count($replacements);
        $replacement_id                = "[[div-$next_id]]";
        $replacements[$replacement_id] = $replacement_html;

        $body = mb_str_replace($rm, $replacement_id, $body);

        if (mb_strpos($body, '<di') !== FALSE && $call_count[$id] < 200) {
          worker($body, $replacements, $id);
        }
      }
    }


    /**
     * Returns text by specifying a start and end point
     *
     * @param str $str
     *   The text to search
     * @param str $start
     *   The beginning identifier
     * @param str $end
     *   The ending identifier
     */
    function xml_get($str, $start, $end) {
      $str = "|" . $str . "|";
      $len = mb_strlen($start);
      if (mb_strpos($str, $start) > 0) {
        $int_start = mb_strpos($str, $start) + $len;
        $temp      = right($str, (mb_strlen($str) - $int_start));
        $int_end   = mb_strpos($temp, $end);
        $return    = trim(left($temp, $int_end));
        return $return;
      }
      else {
        return FALSE;
      }
    }

    function right($str, $count) {
      return mb_substr($str, ($count * -1));
    }

    function left($str, $count) {
      return mb_substr($str, 0, $count);
    }

    /**
     * Multibyte str replace
     */
    if (!function_exists('mb_str_replace')) {

      function mb_str_replace($search, $replace, $subject, &$count = 0) {
        if (!is_array($subject)) {
          $searches     = is_array($search) ? array_values($search) : array($search);
          $replacements = is_array($replace) ? array_values($replace) : array($replace);
          $replacements = array_pad($replacements, count($searches), '');
          foreach ($searches as $key => $search) {
            $parts   = mb_split(preg_quote($search), $subject);
            $count += count($parts) - 1;
            $subject = implode($replacements[$key], $parts);
          }
        }
        else {
          foreach ($subject as $key => $value) {
            $subject[$key] = mb_str_replace($search, $replace, $value, $count);
          }
        }
        return $subject;
      }

    }

    $html = <<<HTML
    <table>
        <tbody>
            <tr>
                <td class="votecell">
                    <div class="vote">
                        <input type="hidden" name="_id_" value="9607101">
                        <a class="vote-up-off" title="This question shows research effort; it is useful and clear">up vote</a>
                        <span itemprop="upvoteCount" class="vote-count-post ">0</span>
                        <a class="vote-down-off" title="This question does not show any research effort; it is unclear or not useful">down vote</a>
                        <a class="star-off" href="#">favorite</a>
                        <div class="favoritecount"><b></b></div>
                    </div>
                </td>
                <td class="postcell">
                    <div>
                        <div class="post-text" itemprop="text">
                            <p>I have a wysiwyg on a site. The problem is that the users are copy pasting a lot of data in to it leaving a lot of unclosed and improperly formatted div tags that are breaking the site layout. </p>
                            <p>Is there an easy an easy way to strip all occurrences of <code>&lt;div&gt;</code> and <code>&lt;/div&gt;</code>?</p>
                            <p>str_replace won't work because some of the divs have styling and other things in them so it would need to account for <code>&lt;div style="some styling"&gt; &lt;div align="center"&gt;</code> etc</p>
                            <p>I'm guessing this could be done with a regular expression but I am total a total beginner when it comes to those. </p>
                            <p>Thanks a lot,
                                Martin
                            </p>
                        </div>
                        <div class="post-taglist">
                            <a href="/questions/tagged/php" class="post-tag js-gps-track" title="show questions tagged 'php'" rel="tag">php</a> <a href="/questions/tagged/regex" class="post-tag js-gps-track" title="show questions tagged 'regex'" rel="tag">regex</a> <a href="/questions/tagged/replace" class="post-tag js-gps-track" title="show questions tagged 'replace'" rel="tag">replace</a> <a href="/questions/tagged/str-replace" class="post-tag js-gps-track" title="" rel="tag">str-replace</a> <a href="/questions/tagged/strip-tags" class="post-tag js-gps-track" title="show questions tagged 'strip-tags'" rel="tag">strip-tags</a>
                        </div>
                        <table class="fw">
                            <tbody>
                                <tr>
                                    <td class="vt">
                                        <div class="post-menu"><a href="/q/9607101" title="short permalink to this question" class="short-link" id="link-post-9607101">share</a><span class="lsep">|</span><a href="/posts/9607101/edit" class="suggest-edit-post" title="">improve this question</a></div>
                                    </td>
                                    <td align="right" class="post-signature">
                                        <div class="user-info ">
                                            <div class="user-action-time">
                                                <a href="/posts/9607101/revisions" title="show all edits to this post">edited <span title="2012-03-07 18:32:29Z" class="relativetime">Mar 7 '12 at 18:32</span></a>
                                            </div>
                                            <div class="user-gravatar32">
                                            </div>
                                            <div class="user-details">
                                                <div class="-flair">
                                                </div>
                                            </div>
                                        </div>
                                    </td>
                                    <td class="post-signature owner">
                                        <div class="user-info ">
                                            <div class="user-action-time">
                                                asked <span title="2012-03-07 18:31:11Z" class="relativetime">Mar 7 '12 at 18:31</span>
                                            </div>
                                            <div class="user-gravatar32">
                                                <a href="/users/702826/martin-hunt">
                                                    <div class="gravatar-wrapper-32"><img src="https://www.gravatar.com/avatar/a578c3eae229c86dbe46d4b1603e071b?s=32&amp;d=identicon&amp;r=PG" alt="" width="32" height="32"></div>
                                                </a>
                                            </div>
                                            <div class="user-details">
                                                <a href="/users/702826/martin-hunt">Martin Hunt</a>
                                                <div class="-flair">
                                                    <span class="reputation-score" title="reputation score " dir="ltr">313</span><span title="7 silver badges"><span class="badge2"></span><span class="badgecount">7</span></span><span title="20 bronze badges"><span class="badge3"></span><span class="badgecount">20</span></span>
                                                </div>
                                            </div>
                                        </div>
                                    </td>
                                </tr>
                            </tbody>
                        </table>
                    </div>
                </td>
            </tr>
            <tr>
                <td class="votecell"></td>
                <td>
                    <div id="comments-9607101" class="comments ">
                        <table>
                            <tbody data-remaining-comments-count="0" data-canpost="false" data-cansee="true" data-comments-unavailable="false" data-addlink-disabled="true">
                                <tr id="comment-12187969" class="comment ">
                                    <td class="comment-actions">
                                        <table>
                                            <tbody>
                                                <tr>
                                                    <td class=" comment-score">
                                                        <span title="number of 'useful comment' votes received" class="cool">1</span>
                                                    </td>
                                                    <td>
                                                        &nbsp;
                                                    </td>
                                                </tr>
                                            </tbody>
                                        </table>
                                    </td>
                                    <td class="comment-text">
                                        <div style="display: block;" class="comment-body">
                                            <span class="comment-copy">So you need to remove all the div tags but not the content between the div. Am I right?</span>
                                            –&nbsp;<a href="/users/500725/siva-charan" title="14,075 reputation" class="comment-user">Siva Charan</a>
                                            <span class="comment-date" dir="ltr"><a class="comment-link" href="#comment12187969_9607101"><span title="2012-03-07 18:34:11Z" class="relativetime-clean">Mar 7 '12 at 18:34</span></a></span>
                                        </div>
                                    </td>
                                </tr>
                                <tr id="comment-12189778" class="comment ">
                                    <td>
                                        <table>
                                            <tbody>
                                                <tr>
                                                    <td class=" comment-score">
                                                        &nbsp;&nbsp;
                                                    </td>
                                                    <td>
                                                        &nbsp;
                                                    </td>
                                                </tr>
                                            </tbody>
                                        </table>
                                    </td>
                                    <td class="comment-text">
                                        <div style="display: block;" class="comment-body">
                                            <span class="comment-copy"><a href="http://*.com/a/4667535/208809">Replace the XPath with <code>//div[not[@*]]</code></a> to remove all div elements (incl. content) without attributes.</span>
                                            –&nbsp;<a href="/users/208809/gordon" title="225,421 reputation" class="comment-user">Gordon</a>
                                            <span class="comment-date" dir="ltr"><a class="comment-link" href="#comment12189778_9607101"><span title="2012-03-07 19:58:21Z" class="relativetime-clean">Mar 7 '12 at 19:58</span></a></span>
                                            <span class="edited-yes" title="this comment was edited 2 times"></span>
                                        </div>
                                    </td>
                                </tr>
                            </tbody>
                        </table>
                    </div>
                    <div id="comments-link-9607101" data-rep="50" data-anon="true">
                        <a class="js-add-link comments-link disabled-link " title="Use comments to ask for more information or suggest improvements. Avoid answering questions in comments.">add a comment</a><span class="js-link-separator dno">&nbsp;|&nbsp;</span>
                        <a class="js-show-link comments-link dno" title="expand to show all comments on this post" href="#" onclick=""></a>
                    </div>
                </td>
            </tr>
        </tbody>
    </table>
    HTML;

    echo strip_divs($html);

#4


-1  

strip_tags($str, '<div>');

http://php.net/manual/en/function.strip-tags.php

http://php.net/manual/en/function.strip-tags.php

#1


-1  

No. You do NOT ever parse/manipulate HTML with regexes.

不。您不会使用正则表达式解析/操纵HTML。

Regexes cannot be bargained with. They can't be reasoned with. They don't understand html, they don't grok xml. And they absolute will NOT stop until your DOM tree is dead.

正则表达式无法讨价还价。他们无法理解。他们不懂html,他们没有grok xml。他们绝对不会停止,直到你的DOM树死了。

You use htmlpurifier and/or DOM to manipulate the tree.

您使用htmlpurifier和/或DOM来操作树。

#2


5  

Better to use DOM for HTML parser but if you have no choice but to use RegEx then you can use it like this:

最好将DOM用于HTML解析器,但如果您别无选择,只能使用RegEx,那么您可以像这样使用它:

$patterns = array();
$patterns[0] = '/<div[^>]*>/';
$patterns[1] = '/<\/div>/';
$replacements = array();
$replacements[2] = '';
$replacements[1] = '';
echo preg_replace($patterns, $replacements, $html);

#3


0  

Here's a simplified example of how you could do it with PHP

这是一个简单的例子,说明如何使用PHP完成它

    <?php
    /**
     * Removes the divs because why not
     */
    function strip_divs(&$text, $id = 'html') {
      $replacements = array();
      worker($text, $replacements, $id);

      foreach ($replacements as $key => $val) {
        $text = mb_str_replace($key, $val, $text);
      }

      return $text;
    }

    function worker(&$body, &$replacements, $id) {
      static $call_count;
      if (empty($call_count)) {
        $call_count = array();
      }
      if (empty($call_count[$id])) {
        $call_count[$id] = 0;
      }

      if (mb_strpos($body, '</div>')) {
        $body = mb_str_replace('</div>', '', $body);
      }

      if (mb_strpos($body, '<di') !== FALSE) {
        $call_count[$id] ++;
        // Gets the important junk
        $rm               = '<di' . xml_get($body, '<di', '>') . '>';
        // Builds the replacements HTML
        $replacement_html = '';

        $next_id                       = count($replacements);
        $replacement_id                = "[[div-$next_id]]";
        $replacements[$replacement_id] = $replacement_html;

        $body = mb_str_replace($rm, $replacement_id, $body);

        if (mb_strpos($body, '<di') !== FALSE && $call_count[$id] < 200) {
          worker($body, $replacements, $id);
        }
      }
    }


    /**
     * Returns text by specifying a start and end point
     *
     * @param str $str
     *   The text to search
     * @param str $start
     *   The beginning identifier
     * @param str $end
     *   The ending identifier
     */
    function xml_get($str, $start, $end) {
      $str = "|" . $str . "|";
      $len = mb_strlen($start);
      if (mb_strpos($str, $start) > 0) {
        $int_start = mb_strpos($str, $start) + $len;
        $temp      = right($str, (mb_strlen($str) - $int_start));
        $int_end   = mb_strpos($temp, $end);
        $return    = trim(left($temp, $int_end));
        return $return;
      }
      else {
        return FALSE;
      }
    }

    function right($str, $count) {
      return mb_substr($str, ($count * -1));
    }

    function left($str, $count) {
      return mb_substr($str, 0, $count);
    }

    /**
     * Multibyte str replace
     */
    if (!function_exists('mb_str_replace')) {

      function mb_str_replace($search, $replace, $subject, &$count = 0) {
        if (!is_array($subject)) {
          $searches     = is_array($search) ? array_values($search) : array($search);
          $replacements = is_array($replace) ? array_values($replace) : array($replace);
          $replacements = array_pad($replacements, count($searches), '');
          foreach ($searches as $key => $search) {
            $parts   = mb_split(preg_quote($search), $subject);
            $count += count($parts) - 1;
            $subject = implode($replacements[$key], $parts);
          }
        }
        else {
          foreach ($subject as $key => $value) {
            $subject[$key] = mb_str_replace($search, $replace, $value, $count);
          }
        }
        return $subject;
      }

    }

    $html = <<<HTML
    <table>
        <tbody>
            <tr>
                <td class="votecell">
                    <div class="vote">
                        <input type="hidden" name="_id_" value="9607101">
                        <a class="vote-up-off" title="This question shows research effort; it is useful and clear">up vote</a>
                        <span itemprop="upvoteCount" class="vote-count-post ">0</span>
                        <a class="vote-down-off" title="This question does not show any research effort; it is unclear or not useful">down vote</a>
                        <a class="star-off" href="#">favorite</a>
                        <div class="favoritecount"><b></b></div>
                    </div>
                </td>
                <td class="postcell">
                    <div>
                        <div class="post-text" itemprop="text">
                            <p>I have a wysiwyg on a site. The problem is that the users are copy pasting a lot of data in to it leaving a lot of unclosed and improperly formatted div tags that are breaking the site layout. </p>
                            <p>Is there an easy an easy way to strip all occurrences of <code>&lt;div&gt;</code> and <code>&lt;/div&gt;</code>?</p>
                            <p>str_replace won't work because some of the divs have styling and other things in them so it would need to account for <code>&lt;div style="some styling"&gt; &lt;div align="center"&gt;</code> etc</p>
                            <p>I'm guessing this could be done with a regular expression but I am total a total beginner when it comes to those. </p>
                            <p>Thanks a lot,
                                Martin
                            </p>
                        </div>
                        <div class="post-taglist">
                            <a href="/questions/tagged/php" class="post-tag js-gps-track" title="show questions tagged 'php'" rel="tag">php</a> <a href="/questions/tagged/regex" class="post-tag js-gps-track" title="show questions tagged 'regex'" rel="tag">regex</a> <a href="/questions/tagged/replace" class="post-tag js-gps-track" title="show questions tagged 'replace'" rel="tag">replace</a> <a href="/questions/tagged/str-replace" class="post-tag js-gps-track" title="" rel="tag">str-replace</a> <a href="/questions/tagged/strip-tags" class="post-tag js-gps-track" title="show questions tagged 'strip-tags'" rel="tag">strip-tags</a>
                        </div>
                        <table class="fw">
                            <tbody>
                                <tr>
                                    <td class="vt">
                                        <div class="post-menu"><a href="/q/9607101" title="short permalink to this question" class="short-link" id="link-post-9607101">share</a><span class="lsep">|</span><a href="/posts/9607101/edit" class="suggest-edit-post" title="">improve this question</a></div>
                                    </td>
                                    <td align="right" class="post-signature">
                                        <div class="user-info ">
                                            <div class="user-action-time">
                                                <a href="/posts/9607101/revisions" title="show all edits to this post">edited <span title="2012-03-07 18:32:29Z" class="relativetime">Mar 7 '12 at 18:32</span></a>
                                            </div>
                                            <div class="user-gravatar32">
                                            </div>
                                            <div class="user-details">
                                                <div class="-flair">
                                                </div>
                                            </div>
                                        </div>
                                    </td>
                                    <td class="post-signature owner">
                                        <div class="user-info ">
                                            <div class="user-action-time">
                                                asked <span title="2012-03-07 18:31:11Z" class="relativetime">Mar 7 '12 at 18:31</span>
                                            </div>
                                            <div class="user-gravatar32">
                                                <a href="/users/702826/martin-hunt">
                                                    <div class="gravatar-wrapper-32"><img src="https://www.gravatar.com/avatar/a578c3eae229c86dbe46d4b1603e071b?s=32&amp;d=identicon&amp;r=PG" alt="" width="32" height="32"></div>
                                                </a>
                                            </div>
                                            <div class="user-details">
                                                <a href="/users/702826/martin-hunt">Martin Hunt</a>
                                                <div class="-flair">
                                                    <span class="reputation-score" title="reputation score " dir="ltr">313</span><span title="7 silver badges"><span class="badge2"></span><span class="badgecount">7</span></span><span title="20 bronze badges"><span class="badge3"></span><span class="badgecount">20</span></span>
                                                </div>
                                            </div>
                                        </div>
                                    </td>
                                </tr>
                            </tbody>
                        </table>
                    </div>
                </td>
            </tr>
            <tr>
                <td class="votecell"></td>
                <td>
                    <div id="comments-9607101" class="comments ">
                        <table>
                            <tbody data-remaining-comments-count="0" data-canpost="false" data-cansee="true" data-comments-unavailable="false" data-addlink-disabled="true">
                                <tr id="comment-12187969" class="comment ">
                                    <td class="comment-actions">
                                        <table>
                                            <tbody>
                                                <tr>
                                                    <td class=" comment-score">
                                                        <span title="number of 'useful comment' votes received" class="cool">1</span>
                                                    </td>
                                                    <td>
                                                        &nbsp;
                                                    </td>
                                                </tr>
                                            </tbody>
                                        </table>
                                    </td>
                                    <td class="comment-text">
                                        <div style="display: block;" class="comment-body">
                                            <span class="comment-copy">So you need to remove all the div tags but not the content between the div. Am I right?</span>
                                            –&nbsp;<a href="/users/500725/siva-charan" title="14,075 reputation" class="comment-user">Siva Charan</a>
                                            <span class="comment-date" dir="ltr"><a class="comment-link" href="#comment12187969_9607101"><span title="2012-03-07 18:34:11Z" class="relativetime-clean">Mar 7 '12 at 18:34</span></a></span>
                                        </div>
                                    </td>
                                </tr>
                                <tr id="comment-12189778" class="comment ">
                                    <td>
                                        <table>
                                            <tbody>
                                                <tr>
                                                    <td class=" comment-score">
                                                        &nbsp;&nbsp;
                                                    </td>
                                                    <td>
                                                        &nbsp;
                                                    </td>
                                                </tr>
                                            </tbody>
                                        </table>
                                    </td>
                                    <td class="comment-text">
                                        <div style="display: block;" class="comment-body">
                                            <span class="comment-copy"><a href="http://*.com/a/4667535/208809">Replace the XPath with <code>//div[not[@*]]</code></a> to remove all div elements (incl. content) without attributes.</span>
                                            –&nbsp;<a href="/users/208809/gordon" title="225,421 reputation" class="comment-user">Gordon</a>
                                            <span class="comment-date" dir="ltr"><a class="comment-link" href="#comment12189778_9607101"><span title="2012-03-07 19:58:21Z" class="relativetime-clean">Mar 7 '12 at 19:58</span></a></span>
                                            <span class="edited-yes" title="this comment was edited 2 times"></span>
                                        </div>
                                    </td>
                                </tr>
                            </tbody>
                        </table>
                    </div>
                    <div id="comments-link-9607101" data-rep="50" data-anon="true">
                        <a class="js-add-link comments-link disabled-link " title="Use comments to ask for more information or suggest improvements. Avoid answering questions in comments.">add a comment</a><span class="js-link-separator dno">&nbsp;|&nbsp;</span>
                        <a class="js-show-link comments-link dno" title="expand to show all comments on this post" href="#" onclick=""></a>
                    </div>
                </td>
            </tr>
        </tbody>
    </table>
    HTML;

    echo strip_divs($html);

#4


-1  

strip_tags($str, '<div>');

http://php.net/manual/en/function.strip-tags.php

http://php.net/manual/en/function.strip-tags.php