使用PHP substr（）和strip_tags（）同时保留格式并且不破坏HTML

I have various HTML strings to cut to 100 characters (of the stripped content, not the original) without stripping tags and without breaking HTML.

我有各种HTML字符串，可以剪切到100个字符（剥离内容，而不是原始字符），无需剥离标记，也不会破坏HTML。

Original HTML string (288 characters):

原始HTML字符串（288个字符）：

$content = "<div>With a <span class='spanClass'>span over here</span> and a
<div class='divClass'>nested div over <div class='nestedDivClass'>there</div>
</div> and a lot of other nested <strong><em>texts</em> and tags in the air
<span>everywhere</span>, it's a HTML taggy kind of day.</strong></div>";

Standard trim: Trim to 100 characters and HTML breaks, stripped content comes to ~40 characters:

标准修剪：修剪到100个字符和HTML中断，剥离内容达到~40个字符：

$content = substr($content, 0, 100)."..."; /* output:
<div>With a <span class='spanClass'>span over here</span> and a
<div class='divClass'>nested div ove... */

Stripped HTML: Outputs correct character count but obviously looses formatting:

剥离的HTML：输出正确的字符数，但显然会丢失格式：

$content = substr(strip_tags($content)), 0, 100)."..."; /* output:
With a span over here and a nested div over there and a lot of other nested
texts and tags in the ai... */

Partial solution: using HTML Tidy or purifier to close off tags outputs clean HTML but 100 characters of HTML not displayed content.

部分解决方案：使用HTML Tidy或净化器关闭标签输出干净的HTML但100个字符的HTML未显示内容。

$content = substr($content, 0, 100)."...";
$tidy = new tidy; $tidy->parseString($content); $tidy->cleanRepair(); /* output:
<div>With a <span class='spanClass'>span over here</span> and a
<div class='divClass'>nested div ove</div></div>... */

Challenge: To output clean HTML and n characters (excluding character count of HTML elements):

挑战：输出干净的HTML和n个字符（不包括HTML元素的字符数）：

$content = cutHTML($content, 100); /* output:
<div>With a <span class='spanClass'>span over here</span> and a
<div class='divClass'>nested div over <div class='nestedDivClass'>there</div>
</div> and a lot of other nested <strong><em>texts</em> and tags in the
ai</strong></div>...";

10 个解决方案

#1

Not amazing, but works.

不奇怪，但有效。

function html_cut($text, $max_length)
{
    $tags   = array();
    $result = "";

    $is_open   = false;
    $grab_open = false;
    $is_close  = false;
    $in_double_quotes = false;
    $in_single_quotes = false;
    $tag = "";

    $i = 0;
    $stripped = 0;

    $stripped_text = strip_tags($text);

    while ($i < strlen($text) && $stripped < strlen($stripped_text) && $stripped < $max_length)
    {
        $symbol  = $text{$i};
        $result .= $symbol;

        switch ($symbol)
        {
           case '<':
                $is_open   = true;
                $grab_open = true;
                break;

           case '"':
               if ($in_double_quotes)
                   $in_double_quotes = false;
               else
                   $in_double_quotes = true;

            break;

            case "'":
              if ($in_single_quotes)
                  $in_single_quotes = false;
              else
                  $in_single_quotes = true;

            break;

            case '/':
                if ($is_open && !$in_double_quotes && !$in_single_quotes)
                {
                    $is_close  = true;
                    $is_open   = false;
                    $grab_open = false;
                }

                break;

            case ' ':
                if ($is_open)
                    $grab_open = false;
                else
                    $stripped++;

                break;

            case '>':
                if ($is_open)
                {
                    $is_open   = false;
                    $grab_open = false;
                    array_push($tags, $tag);
                    $tag = "";
                }
                else if ($is_close)
                {
                    $is_close = false;
                    array_pop($tags);
                    $tag = "";
                }

                break;

            default:
                if ($grab_open || $is_close)
                    $tag .= $symbol;

                if (!$is_open && !$is_close)
                    $stripped++;
        }

        $i++;
    }

    while ($tags)
        $result .= "</".array_pop($tags).">";

    return $result;
}

Usage example:

用法示例：

$content = html_cut($content, 100);

#2

I'm not claiming to have invented this, but there is a very complete Text::truncate() method in CakePHP which does what you want:

我并没有声称已经发明了这个，但CakePHP中有一个非常完整的Text :: truncate（）方法可以满足您的需求：

function truncate($text, $length = 100, $ending = '...', $exact = true, $considerHtml = false) {
    if (is_array($ending)) {
        extract($ending);
    }
    if ($considerHtml) {
        if (mb_strlen(preg_replace('/<.*?>/', '', $text)) <= $length) {
            return $text;
        }
        $totalLength = mb_strlen($ending);
        $openTags = array();
        $truncate = '';
        preg_match_all('/(<\/?([\w+]+)[^>]*>)?([^<>]*)/', $text, $tags, PREG_SET_ORDER);
        foreach ($tags as $tag) {
            if (!preg_match('/img|br|input|hr|area|base|basefont|col|frame|isindex|link|meta|param/s', $tag[2])) {
                if (preg_match('/<[\w]+[^>]*>/s', $tag[0])) {
                    array_unshift($openTags, $tag[2]);
                } else if (preg_match('/<\/([\w]+)[^>]*>/s', $tag[0], $closeTag)) {
                    $pos = array_search($closeTag[1], $openTags);
                    if ($pos !== false) {
                        array_splice($openTags, $pos, 1);
                    }
                }
            }
            $truncate .= $tag[1];

            $contentLength = mb_strlen(preg_replace('/&[0-9a-z]{2,8};|&#[0-9]{1,7};|&#x[0-9a-f]{1,6};/i', ' ', $tag[3]));
            if ($contentLength + $totalLength > $length) {
                $left = $length - $totalLength;
                $entitiesLength = 0;
                if (preg_match_all('/&[0-9a-z]{2,8};|&#[0-9]{1,7};|&#x[0-9a-f]{1,6};/i', $tag[3], $entities, PREG_OFFSET_CAPTURE)) {
                    foreach ($entities[0] as $entity) {
                        if ($entity[1] + 1 - $entitiesLength <= $left) {
                            $left--;
                            $entitiesLength += mb_strlen($entity[0]);
                        } else {
                            break;
                        }
                    }
                }

                $truncate .= mb_substr($tag[3], 0 , $left + $entitiesLength);
                break;
            } else {
                $truncate .= $tag[3];
                $totalLength += $contentLength;
            }
            if ($totalLength >= $length) {
                break;
            }
        }

    } else {
        if (mb_strlen($text) <= $length) {
            return $text;
        } else {
            $truncate = mb_substr($text, 0, $length - strlen($ending));
        }
    }
    if (!$exact) {
        $spacepos = mb_strrpos($truncate, ' ');
        if (isset($spacepos)) {
            if ($considerHtml) {
                $bits = mb_substr($truncate, $spacepos);
                preg_match_all('/<\/([a-z]+)>/', $bits, $droppedTags, PREG_SET_ORDER);
                if (!empty($droppedTags)) {
                    foreach ($droppedTags as $closingTag) {
                        if (!in_array($closingTag[1], $openTags)) {
                            array_unshift($openTags, $closingTag[1]);
                        }
                    }
                }
            }
            $truncate = mb_substr($truncate, 0, $spacepos);
        }
    }

    $truncate .= $ending;

    if ($considerHtml) {
        foreach ($openTags as $tag) {
            $truncate .= '</'.$tag.'>';
        }
    }

    return $truncate;
}

#3

Use PHP's DOMDocument class to normalize an HTML fragment:

使用PHP的DOMDocument类来规范化HTML片段：

$dom= new DOMDocument();
$dom->loadHTML('<div><p>Hello World');      
$xpath = new DOMXPath($dom);
$body = $xpath->query('/html/body');
echo($dom->saveXml($body->item(0)));

This question is similar to an earlier question and I've copied and pasted one solution here. If the HTML is submitted by users you'll also need to filter out potential Javascript attack vectors like onmouseover="do_something_evil()" or <a href="javascript:more_evil();">...</a>. Tools like HTML Purifier were designed to catch and solve these problems and are far more comprehensive than any code that I could post.

这个问题类似于之前的问题，我在这里复制并粘贴了一个解决方案。如果用户提交HTML，您还需要过滤掉潜在的Javascript攻击媒介，例如onmouseover =“do_something_evil（）”或 ...... 。像HTML Purifier这样的工具旨在捕获并解决这些问题，并且比我发布的任何代码都要全面。

#4

Use a HTML parser and stop after 100 characters of text.

使用HTML解析器并在100个字符的文本后停止。

#5

You should use Tidy HTML. You cut the string and then you run Tidy to close the tags.

你应该使用Tidy HTML。你切断字符串，然后运行Tidy来关闭标签。

(Credits where credits are due)

（学分到期的学分）

#6

I did another function to do it, supports UTF-8:

我做了另外一个功能，支持UTF-8：

/**
 * Limit string without break html tags.
 * Supports UTF8
 * 
 * @param string $value
 * @param int $limit Default 100
 */
function str_limit_html($value, $limit = 100)
{

    if (mb_strwidth($value, 'UTF-8') <= $limit) {
        return $value;
    }

    // Strip text with HTML tags, sum html len tags too.
    // Is there another way to do it?
    do {
        $len          = mb_strwidth($value, 'UTF-8');
        $len_stripped = mb_strwidth(strip_tags($value), 'UTF-8');
        $len_tags     = $len - $len_stripped;

        $value = mb_strimwidth($value, 0, $limit + $len_tags, '', 'UTF-8');
    } while ($len_stripped > $limit);

    // Load as HTML ignoring errors
    $dom = new DOMDocument();
    @$dom->loadHTML('<?xml encoding="utf-8" ?>'.$value, LIBXML_HTML_NODEFDTD);

    // Fix the html errors
    $value = $dom->saveHtml($dom->getElementsByTagName('body')->item(0));

    // Remove body tag
    $value = mb_strimwidth($value, 6, mb_strwidth($value, 'UTF-8') - 13, '', 'UTF-8'); // <body> and </body>
    // Remove empty tags
    return preg_replace('/<(\w+)\b(?:\s+[\w\-.:]+(?:\s*=\s*(?:"[^"]*"|"[^"]*"|[\w\-.:]+))?)*\s*\/?>\s*<\/\1\s*>/', '', $value);
}

SEE DEMO.

看看演示。

I recommend use html_entity_decode in start of function, so preserve the UTF-8 characters:

我建议在函数的开头使用html_entity_decode，所以保留UTF-8字符：

 $value = html_entity_decode($value);

#7

Regardless of the 100 count issues you state at the beginning, you indicate in the challenge the following:

无论您在开始时陈述的100个计数问题，您都在挑战中指出以下内容：

output the character count of strip_tags (the number of characters in the actual displayed text of the HTML)
输出strip_tags的字符数（HTML实际显示文本中的字符数）
retain HTML formatting close
保持HTML格式关闭
any unfinished HTML tag
任何未完成的HTML标记

Here is my proposal: Bascially, I parse through each character counting as I go. I make sure NOT to count any characters in any HTML tag. I also check at the end to make sure I am not in the middle of a word when I stop. Once I stop, I back track to the first available SPACE or > as a stopping point.

这是我的建议：基本上，我会按照每个字符进行解析。我确保不计算任何HTML标记中的任何字符。我也在最后检查，以确保当我停下来时，我不在一个字的中间。一旦我停下来，我就会回溯到第一个可用的SPACE或>作为停止点。

$position = 0;
$length = strlen($content)-1;

// process the content putting each 100 character section into an array
while($position < $length)
{
    $next_position = get_position($content, $position, 100);
    $data[] = substr($content, $position, $next_position);
    $position = $next_position;
}

// show the array
print_r($data);

function get_position($content, $position, $chars = 100)
{
    $count = 0;
    // count to 100 characters skipping over all of the HTML
    while($count <> $chars){
        $char = substr($content, $position, 1); 
        if($char == '<'){
            do{
                $position++;
                $char = substr($content, $position, 1);
            } while($char !== '>');
            $position++;
            $char = substr($content, $position, 1);
        }
        $count++;
        $position++;
    }
echo $count."\n";
    // find out where there is a logical break before 100 characters
    $data = substr($content, 0, $position);

    $space = strrpos($data, " ");
    $tag = strrpos($data, ">");

    // return the position of the logical break
    if($space > $tag)
    {
        return $space;
    } else {
        return $tag;
    }  
}

This will also count the return codes etc. Considering they will take space, I have not removed them.

这也将计算返回代码等。考虑到它们将占用空间，我没有删除它们。

#8

Here is a function I'm using in one of my projects. It's based on DOMDocument, works with HTML5 and is about 2x faster than other solutions I've tried (at least on my machine, 0.22 ms vs 0.43 ms using html_cut($text, $max_length) from the top answer on a 500 text-node-characters string with a limit of 400).

这是我在我的一个项目中使用的函数。它基于DOMDocument，与HTML5一起使用，比我尝试过的其他解决方案快2倍（至少在我的机器上，使用html_cut（$ text，$ max_length）的0.22 ms vs 0.43 ms来自500文本的最高答案 - 节点字符串，限制为400）。

function cut_html ($html, $limit) {
    $dom = new DOMDocument();
    $dom->loadHTML(mb_convert_encoding("<div>{$html}</div>", "HTML-ENTITIES", "UTF-8"), LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
    cut_html_recursive($dom->documentElement, $limit);
    return substr($dom->saveHTML($dom->documentElement), 5, -6);
}

function cut_html_recursive ($element, $limit) {
    if($limit > 0) {
        if($element->nodeType == 3) {
            $limit -= strlen($element->nodeValue);
            if($limit < 0) {
                $element->nodeValue = substr($element->nodeValue, 0, strlen($element->nodeValue) + $limit);
            }
        }
        else {
            for($i = 0; $i < $element->childNodes->length; $i++) {
                if($limit > 0) {
                    $limit = cut_html_recursive($element->childNodes->item($i), $limit);
                }
                else {
                    $element->removeChild($element->childNodes->item($i));
                    $i--;
                }
            }
        }
    }
    return $limit;
}

#9

Here is my try at the cutter. Maybe you guys can catch some bugs. The problem, i found with the other parsers, is that they don't close tags properly and they cut in the middle of a word (blah)

这是我对刀具的尝试。也许你们可以捕捉到一些错误。我发现与其他解析器一样，问题是它们没有正确关闭标签而且它们切入一个单词的中间（等等）

function cutHTML($string, $length, $patternsReplace = false) {
    $i = 0;
    $count = 0;
    $isParagraphCut = false;
    $htmlOpen = false;
    $openTag = false;
    $tagsStack = array();

    while ($i < strlen($string)) {
        $char = substr($string, $i, 1);
        if ($count >= $length) {
            $isParagraphCut = true;
            break;
        }

        if ($htmlOpen) {
            if ($char === ">") {
                $htmlOpen = false;
            }
        } else {
            if ($char === "<") {
                $j = $i;
                $char = substr($string, $j, 1);

                while ($j < strlen($string)) {
                    if($char === '/'){
                        $i++;
                        break;
                    }
                    elseif ($char === ' ') {
                        $tagsStack[] = substr($string, $i, $j);
                    }
                    $j++;
                }
                $htmlOpen = true;
            }
        }

        if (!$htmlOpen && $char != ">") {
            $count++;
        }

        $i++;
    }

    if ($isParagraphCut) {
        $j = $i;
        while ($j > 0) {
            $char = substr($string, $j, 1);
            if ($char === " " || $char === ";" || $char === "." || $char === "," || $char === "<" || $char === "(" || $char === "[") {
                break;
            } else if ($char === ">") {
                $j++;
                break;
            }
            $j--;
        }
        $string = substr($string, 0, $j);
        foreach($tagsStack as $tag){
            $tag = strtolower($tag);
            if($tag !== "img" && $tag !== "br"){
                $string .= "</$tag>";
            }
        }
        $string .= "...";
    }

    if ($patternsReplace) {
        foreach ($patternsReplace as $value) {
            if (isset($value['pattern']) && isset($value["replace"])) {
                $string = preg_replace($value["pattern"], $value["replace"], $string);
            }
        }
    }
    return $string;
}

#10

try this function

试试这个功能

// trim the string function
function trim_word($text, $length, $startPoint=0, $allowedTags=""){
    $text = html_entity_decode(htmlspecialchars_decode($text));
    $text = strip_tags($text, $allowedTags);
    return $text = substr($text, $startPoint, $length);
}

and

和

echo trim_word("<h2 class='zzzz'>abcasdsdasasdas</h2>","6");

#1