从字符串中删除HTML标记。

时间:2022-08-27 17:13:00

Is there a good way to remove HTML from a Java string? A simple regex like

是否有从Java字符串中删除HTML的好方法?一个简单的正则表达式如

 replaceAll("\\<.*?>","") 

will work, but things like &amp; wont be converted correctly and non-HTML between the two angle brackets will be removed (i.e. the .*? in the regex will disappear).

会工作,但诸如此类的事;不会在两个尖括号之间正确地转换和非html(即,*?)在regex将消失)。

26 个解决方案

#1


467  

Use a HTML parser instead of regex. This is dead simple with Jsoup.

使用HTML解析器代替正则表达式。这对Jsoup来说很简单。

public static String html2text(String html) {
    return Jsoup.parse(html).text();
}

Jsoup also supports removing HTML tags against a customizable whitelist, which is very useful if you want to allow only e.g. <b>, <i> and <u>.

Jsoup还支持将HTML标记移到可定制的白名单上,如果您想只允许这样做的话,这非常有用。

See also:

#2


250  

If you're writing for Android you can do this...

如果你在为Android写文章,你可以这样做…

android.text.Html.fromHtml(instruction).toString()

#3


69  

If the user enters <b>hey!</b>, do you want to display <b>hey!</b> or hey!? If the first, escape less-thans, and html-encode ampersands (and optionally quotes) and you're fine. A modification to your code to implement the second option would be:

如果用户输入嘿!,你想要显示嘿!< / b >还是嘿! ?如果是第一个,则要避免使用较少的thans和html编码的ampersands(以及可选的引号),这样就可以了。修改您的代码以实现第二个选项将是:

replaceAll("\\<[^>]*>","")

but you will run into issues if the user enters something malformed, like <bhey!</b>.

但是如果用户输入了一些格式不正确的东西,比如 ,你就会遇到问题。 !

You can also check out JTidy which will parse "dirty" html input, and should give you a way to remove the tags, keeping the text.

您还可以检查JTidy,它将解析“脏”的html输入,并且应该给您一种方法来删除标记,保留文本。

The problem with trying to strip html is that browsers have very lenient parsers, more lenient than any library you can find will, so even if you do your best to strip all tags (using the replace method above, a DOM library, or JTidy), you will still need to make sure to encode any remaining HTML special characters to keep your output safe.

试图剥夺html浏览器的问题有非常宽松的解析器,比任何图书馆你可以找到更宽松,所以即使你最好带所有标签(上面使用替代方法,DOM库,或JTidy),您仍然需要确保任何剩余的编码输出html特殊字符来保持你的安全。

#4


26  

Another way is to use javax.swing.text.html.HTMLEditorKit to extract the text.

另一种方法是使用javax.swing.text.html。HTMLEditorKit提取文本。

import java.io.*;
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;

public class Html2Text extends HTMLEditorKit.ParserCallback {
    StringBuffer s;

    public Html2Text() {
    }

    public void parse(Reader in) throws IOException {
        s = new StringBuffer();
        ParserDelegator delegator = new ParserDelegator();
        // the third parameter is TRUE to ignore charset directive
        delegator.parse(in, this, Boolean.TRUE);
    }

    public void handleText(char[] text, int pos) {
        s.append(text);
    }

    public String getText() {
        return s.toString();
    }

    public static void main(String[] args) {
        try {
            // the HTML to convert
            FileReader in = new FileReader("java-new.html");
            Html2Text parser = new Html2Text();
            parser.parse(in);
            in.close();
            System.out.println(parser.getText());
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

ref : Remove HTML tags from a file to extract only the TEXT

ref:从文件中删除HTML标记,只提取文本。

#5


17  

Also very simple using Jericho, and you can retain some of the formatting (line breaks and links, for example).

同样非常简单的使用杰里科,你可以保留一些格式(例如,换行和链接)。

    Source htmlSource = new Source(htmlText);
    Segment htmlSeg = new Segment(htmlSource, 0, htmlSource.length());
    Renderer htmlRend = new Renderer(htmlSeg);
    System.out.println(htmlRend.toString());

#6


16  

I think that the simpliest way to filter the html tags is:

我认为最简单的过滤html标签的方法是:

private static final Pattern REMOVE_TAGS = Pattern.compile("<.+?>");

public static String removeTags(String string) {
    if (string == null || string.length() == 0) {
        return string;
    }

    Matcher m = REMOVE_TAGS.matcher(string);
    return m.replaceAll("");
}

#7


13  

On Android, try this:

在Android上,试试这个:

String result = Html.fromHtml(html).toString();

#8


11  

HTML Escaping is really hard to do right- I'd definitely suggest using library code to do this, as it's a lot more subtle than you'd think. Check out Apache's StringEscapeUtils for a pretty good library for handling this in Java.

HTML的转义是很难做到的——我绝对建议使用库代码来做这个,因为它比你想象的要微妙得多。请查看Apache的StringEscapeUtils,它是用于处理Java的一个非常好的库。

#9


10  

The accepted answer of doing simply Jsoup.parse(html).text() has 2 potential issues (with JSoup 1.7.3):

简单的JSoup .parse(html).text()有两个潜在的问题(JSoup 1.7.3):

  • It removes line breaks from the text
  • 它删除了文本中的断行。
  • It converts text &lt;script&gt; into <script>
  • 它将文本& lt;script>在 <脚本>

If you use this to protect against XSS, this is a bit annoying. Here is my best shot at an improved solution, using both JSoup and Apache StringEscapeUtils:

如果你用这个来保护XSS,这有点烦人。这里是我最好的一个改进的解决方案,使用JSoup和Apache StringEscapeUtils:

// breaks multi-level of escaping, preventing &amp;lt;script&amp;gt; to be rendered as <script>
String replace = input.replace("&amp;", "");
// decode any encoded html, preventing &lt;script&gt; to be rendered as <script>
String html = StringEscapeUtils.unescapeHtml(replace);
// remove all html tags, but maintain line breaks
String clean = Jsoup.clean(html, "", Whitelist.none(), new Document.OutputSettings().prettyPrint(false));
// decode html again to convert character entities back into text
return StringEscapeUtils.unescapeHtml(clean);

Note that the last step is because I need to use the output as plain text. If you need only HTML output then you should be able to remove it.

请注意,最后一步是我需要使用输出作为纯文本。如果您只需要HTML输出,那么您应该能够删除它。

And here is a bunch of test cases (input to output):

这里有一些测试用例(输入输出):

{"regular string", "regular string"},
{"<a href=\"link\">A link</a>", "A link"},
{"<script src=\"http://evil.url.com\"/>", ""},
{"&lt;script&gt;", ""},
{"&amp;lt;script&amp;gt;", "lt;scriptgt;"}, // best effort
{"\" ' > < \n \\ é å à ü and & preserved", "\" ' > < \n \\ é å à ü and & preserved"}

If you find a way to make it better, please let me know.

如果你想办法做得更好,请告诉我。

#10


5  

You might want to replace <br/> and </p> tags with newlines before stripping the HTML to prevent it becoming an illegible mess as Tim suggests.

在删除HTML之前,您可能想要用换行符替换

标记,以防止它变成像Tim建议的那样难以理解的混乱。

The only way I can think of removing HTML tags but leaving non-HTML between angle brackets would be check against a list of HTML tags. Something along these lines...

我唯一能想到的方法是删除HTML标签,但在尖括号之间留下非HTML标签,将会对HTML标签列表进行检查。沿着这些方向的东西……

replaceAll("\\<[\s]*tag[^>]*>","")

Then HTML-decode special characters such as &amp;. The result should not be considered to be sanitized.

然后是HTML-decode特殊字符,如&结果不应该被认为是经过消毒的。

#11


3  

The accepted answer did not work for me for the test case I indicated: the result of "a < b or b > c" is "a b or b > c".

对于我所指出的测试用例,被接受的答案并没有起作用:“a < b或b > c”的结果是“a b或b > c”。

So, I used TagSoup instead. Here's a shot that worked for my test case (and a couple of others):

所以,我用的是TagSoup。这里有一个为我的测试用例(以及其他几个)工作的照片:

import java.io.IOException;
import java.io.StringReader;
import java.util.logging.Logger;

import org.ccil.cowan.tagsoup.Parser;
import org.xml.sax.Attributes;
import org.xml.sax.ContentHandler;
import org.xml.sax.InputSource;
import org.xml.sax.Locator;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;

/**
 * Take HTML and give back the text part while dropping the HTML tags.
 *
 * There is some risk that using TagSoup means we'll permute non-HTML text.
 * However, it seems to work the best so far in test cases.
 *
 * @author dan
 * @see <a href="http://home.ccil.org/~cowan/XML/tagsoup/">TagSoup</a> 
 */
public class Html2Text2 implements ContentHandler {
private StringBuffer sb;

public Html2Text2() {
}

public void parse(String str) throws IOException, SAXException {
    XMLReader reader = new Parser();
    reader.setContentHandler(this);
    sb = new StringBuffer();
    reader.parse(new InputSource(new StringReader(str)));
}

public String getText() {
    return sb.toString();
}

@Override
public void characters(char[] ch, int start, int length)
    throws SAXException {
    for (int idx = 0; idx < length; idx++) {
    sb.append(ch[idx+start]);
    }
}

@Override
public void ignorableWhitespace(char[] ch, int start, int length)
    throws SAXException {
    sb.append(ch);
}

// The methods below do not contribute to the text
@Override
public void endDocument() throws SAXException {
}

@Override
public void endElement(String uri, String localName, String qName)
    throws SAXException {
}

@Override
public void endPrefixMapping(String prefix) throws SAXException {
}


@Override
public void processingInstruction(String target, String data)
    throws SAXException {
}

@Override
public void setDocumentLocator(Locator locator) {
}

@Override
public void skippedEntity(String name) throws SAXException {
}

@Override
public void startDocument() throws SAXException {
}

@Override
public void startElement(String uri, String localName, String qName,
    Attributes atts) throws SAXException {
}

@Override
public void startPrefixMapping(String prefix, String uri)
    throws SAXException {
}
}

#12


3  

Here's a lightly more fleshed out update to try to handle some formatting for breaks and lists. I used Amaya's output as a guide.

这里有一个稍微充实的更新,以尝试处理中断和列表的一些格式。我用Amaya的输出作为向导。

import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;
import java.util.Stack;
import java.util.logging.Logger;

import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.HTML;
import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.html.parser.ParserDelegator;

public class HTML2Text extends HTMLEditorKit.ParserCallback {
    private static final Logger log = Logger
            .getLogger(Logger.GLOBAL_LOGGER_NAME);

    private StringBuffer stringBuffer;

    private Stack<IndexType> indentStack;

    public static class IndexType {
        public String type;
        public int counter; // used for ordered lists

        public IndexType(String type) {
            this.type = type;
            counter = 0;
        }
    }

    public HTML2Text() {
        stringBuffer = new StringBuffer();
        indentStack = new Stack<IndexType>();
    }

    public static String convert(String html) {
        HTML2Text parser = new HTML2Text();
        Reader in = new StringReader(html);
        try {
            // the HTML to convert
            parser.parse(in);
        } catch (Exception e) {
            log.severe(e.getMessage());
        } finally {
            try {
                in.close();
            } catch (IOException ioe) {
                // this should never happen
            }
        }
        return parser.getText();
    }

    public void parse(Reader in) throws IOException {
        ParserDelegator delegator = new ParserDelegator();
        // the third parameter is TRUE to ignore charset directive
        delegator.parse(in, this, Boolean.TRUE);
    }

    public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {
        log.info("StartTag:" + t.toString());
        if (t.toString().equals("p")) {
            if (stringBuffer.length() > 0
                    && !stringBuffer.substring(stringBuffer.length() - 1)
                            .equals("\n")) {
                newLine();
            }
            newLine();
        } else if (t.toString().equals("ol")) {
            indentStack.push(new IndexType("ol"));
            newLine();
        } else if (t.toString().equals("ul")) {
            indentStack.push(new IndexType("ul"));
            newLine();
        } else if (t.toString().equals("li")) {
            IndexType parent = indentStack.peek();
            if (parent.type.equals("ol")) {
                String numberString = "" + (++parent.counter) + ".";
                stringBuffer.append(numberString);
                for (int i = 0; i < (4 - numberString.length()); i++) {
                    stringBuffer.append(" ");
                }
            } else {
                stringBuffer.append("*   ");
            }
            indentStack.push(new IndexType("li"));
        } else if (t.toString().equals("dl")) {
            newLine();
        } else if (t.toString().equals("dt")) {
            newLine();
        } else if (t.toString().equals("dd")) {
            indentStack.push(new IndexType("dd"));
            newLine();
        }
    }

    private void newLine() {
        stringBuffer.append("\n");
        for (int i = 0; i < indentStack.size(); i++) {
            stringBuffer.append("    ");
        }
    }

    public void handleEndTag(HTML.Tag t, int pos) {
        log.info("EndTag:" + t.toString());
        if (t.toString().equals("p")) {
            newLine();
        } else if (t.toString().equals("ol")) {
            indentStack.pop();
            ;
            newLine();
        } else if (t.toString().equals("ul")) {
            indentStack.pop();
            ;
            newLine();
        } else if (t.toString().equals("li")) {
            indentStack.pop();
            ;
            newLine();
        } else if (t.toString().equals("dd")) {
            indentStack.pop();
            ;
        }
    }

    public void handleSimpleTag(HTML.Tag t, MutableAttributeSet a, int pos) {
        log.info("SimpleTag:" + t.toString());
        if (t.toString().equals("br")) {
            newLine();
        }
    }

    public void handleText(char[] text, int pos) {
        log.info("Text:" + new String(text));
        stringBuffer.append(text);
    }

    public String getText() {
        return stringBuffer.toString();
    }

    public static void main(String args[]) {
        String html = "<html><body><p>paragraph at start</p>hello<br />What is happening?<p>this is a<br />mutiline paragraph</p><ol>  <li>This</li>  <li>is</li>  <li>an</li>  <li>ordered</li>  <li>list    <p>with</p>    <ul>      <li>another</li>      <li>list        <dl>          <dt>This</dt>          <dt>is</dt>            <dd>sdasd</dd>            <dd>sdasda</dd>            <dd>asda              <p>aasdas</p>            </dd>            <dd>sdada</dd>          <dt>fsdfsdfsd</dt>        </dl>        <dl>          <dt>vbcvcvbcvb</dt>          <dt>cvbcvbc</dt>            <dd>vbcbcvbcvb</dd>          <dt>cvbcv</dt>          <dt></dt>        </dl>        <dl>          <dt></dt>        </dl></li>      <li>cool</li>    </ul>    <p>stuff</p>  </li>  <li>cool</li></ol><p></p></body></html>";
        System.out.println(convert(html));
    }
}

#13


3  

Use Html.fromHtml

使用Html.fromHtml

HTML Tags are

HTML标签

<a href=”…”> <b>,  <big>, <blockquote>, <br>, <cite>, <dfn>
<div align=”…”>,  <em>, <font size=”…” color=”…” face=”…”>
<h1>,  <h2>, <h3>, <h4>,  <h5>, <h6>
<i>, <p>, <small>
<strike>,  <strong>, <sub>, <sup>, <tt>, <u>

As per Android’s official Documentations any tags in the HTML will display as a generic replacement String which your program can then go through and replace with real strings.

根据Android的官方文档,HTML中的任何标记都将显示为一个通用的替换字符串,您的程序可以通过这个字符串来替换字符串。

Html.formHtml method takes an Html.TagHandler and an Html.ImageGetter as arguments as well as the text to parse.

Html。formHtml方法采用Html。TagHandler和Html。ImageGetter作为参数,以及用于解析的文本。

Example

String Str_Html=" <p>This is about me text that the user can put into their profile</p> ";

Then

然后

Your_TextView_Obj.setText(Html.fromHtml(Str_Html).toString());

Output

输出

This is about me text that the user can put into their profile

这是关于我的文本,用户可以放入他们的档案。

#14


2  

One more way can be to use com.google.gdata.util.common.html.HtmlToText class like

另一种方法是使用com.google.gdata.util.common.html.HtmlToText类。

MyWriter.toConsole(HtmlToText.htmlToPlainText(htmlResponse));

This is not bullet proof code though and when I run it on wikipedia entries I am getting style info also. However I believe for small/simple jobs this would be effective.

这不是防弹代码,当我在*条目上运行时,我也得到了样式信息。然而,我相信这将是有效的小/简单的工作。

#15


2  

I know this is old, but I was just working on a project that required me to filter HTML and this worked fine:

我知道这是旧的,但我只是在做一个项目,需要我过滤HTML,这个工作很好:

noHTMLString.replaceAll("\\&.*?\\;", "");

instead of this:

而不是:

html = html.replaceAll("&nbsp;","");
html = html.replaceAll("&amp;"."");

#16


2  

It sounds like you want to go from HTML to plain text.
If that is the case look at www.htmlparser.org. Here is an example that strips all the tags out from the html file found at a URL.
It makes use of org.htmlparser.beans.StringBean.

听起来你想从HTML到纯文本。如果是这样的话,请查看www.htmlparser.org。这里有一个示例,它从URL中找到的html文件中去掉所有标记。它使用了org.htmlparser.beans.StringBean。

static public String getUrlContentsAsText(String url) {
    String content = "";
    StringBean stringBean = new StringBean();
    stringBean.setURL(url);
    content = stringBean.getStrings();
    return content;
}

#17


2  

Here is another way to do it:

下面是另一种方法:

public static String removeHTML(String input) {
    int i = 0;
    String[] str = input.split("");

    String s = "";
    boolean inTag = false;

    for (i = input.indexOf("<"); i < input.indexOf(">"); i++) {
        inTag = true;
    }
    if (!inTag) {
        for (i = 0; i < str.length; i++) {
            s = s + str[i];
        }
    }
    return s;
}

#18


2  

Alternatively, one can use HtmlCleaner:

或者,可以使用HtmlCleaner:

private CharSequence removeHtmlFrom(String html) {
    return new HtmlCleaner().clean(html).getText();
}

#19


2  

This should work -

这应该工作

use this

使用这个

  text.replaceAll('<.*?>' , " ") -> This will replace all the html tags with a space.

and this

  text.replaceAll('&.*?;' , "")-> this will replace all the tags which starts with "&" and ends with ";" like &nbsp;, &amp;, &gt; etc.

#20


1  

One could also use Apache Tika for this purpose. By default it preserves whitespaces from the stripped html, which may be desired in certain situations:

为此,还可以使用Apache Tika。默认情况下,它保留了被剥离的html的空白,这在某些情况下可能是需要的:

InputStream htmlInputStream = ..
HtmlParser htmlParser = new HtmlParser();
HtmlContentHandler htmlContentHandler = new HtmlContentHandler();
htmlParser.parse(htmlInputStream, htmlContentHandler, new Metadata())
System.out.println(htmlContentHandler.getBodyText().trim())

#21


0  

My 5 cents:

我的5美分。

String[] temp = yourString.split("&amp;");
String tmp = "";
if (temp.length > 1) {

    for (int i = 0; i < temp.length; i++) {
        tmp += temp[i] + "&";
    }
    yourString = tmp.substring(0, tmp.length() - 1);
}

#22


0  

To get formateed plain html text you can do that:

要得到格式化的纯html文本,你可以这样做:

String BR_ESCAPED = "&lt;br/&gt;";
Element el=Jsoup.parse(html).select("body");
el.select("br").append(BR_ESCAPED);
el.select("p").append(BR_ESCAPED+BR_ESCAPED);
el.select("h1").append(BR_ESCAPED+BR_ESCAPED);
el.select("h2").append(BR_ESCAPED+BR_ESCAPED);
el.select("h3").append(BR_ESCAPED+BR_ESCAPED);
el.select("h4").append(BR_ESCAPED+BR_ESCAPED);
el.select("h5").append(BR_ESCAPED+BR_ESCAPED);
String nodeValue=el.text();
nodeValue=nodeValue.replaceAll(BR_ESCAPED, "<br/>");
nodeValue=nodeValue.replaceAll("(\\s*<br[^>]*>){3,}", "<br/><br/>");

To get formateed plain text change <br/> by \n and change last line by:

要得到格式化的纯文本更改
/ \n并更改最后一行:

nodeValue=nodeValue.replaceAll("(\\s*\n){3,}", "<br/><br/>");

#23


0  

Remove HTML tags from string. Somewhere we need to parse some string which is received by some responses like Httpresponse from the server.

从字符串中删除HTML标记。在某个地方,我们需要解析一些响应,比如来自服务器的Httpresponse之类的响应。

So we need to parse it.

所以我们需要解析它。

Here I will show how to remove html tags from string.

这里我将展示如何从字符串中删除html标记。

    // sample text with tags

    string str = "<html><head>sdfkashf sdf</head><body>sdfasdf</body></html>";



    // regex which match tags

    System.Text.RegularExpressions.Regex rx = new System.Text.RegularExpressions.Regex("<[^>]*>");



    // replace all matches with empty strin

    str = rx.Replace(str, "");



    //now str contains string without html tags

#24


0  

One way to retain new-line info with JSoup is to precede all new line tags with some dummy string, execute JSoup and replace dummy string with "\n".

用JSoup保留新行信息的一种方法是在所有新行标记前面加上一些假字符串,执行JSoup并将假字符串替换为“\n”。

String html = "<p>Line one</p><p>Line two</p>Line three<br/>etc.";
String NEW_LINE_MARK = "NEWLINESTART1234567890NEWLINEEND";
for (String tag: new String[]{"</p>","<br/>","</h1>","</h2>","</h3>","</h4>","</h5>","</h6>","</li>"}) {
    html = html.replace(tag, NEW_LINE_MARK+tag);
}

String text = Jsoup.parse(html).text();

text = text.replace(NEW_LINE_MARK + " ", "\n\n");
text = text.replace(NEW_LINE_MARK, "\n\n");

#25


0  

ex: classeString.replaceAll("\<(/?[^\>]+)\>", "\ ").replaceAll("\s+", " ").trim()

例:classeString.replaceAll(“\ <(/ ?[^ \ >]+)\ >”,“\”)。replaceAll(“\ s + "," ").trim()

#26


-1  

you can simply make a method with multiple replaceAll() like

您可以简单地创建一个具有多个replaceAll()的方法。

String RemoveTag(String html){
   html = html.replaceAll("\\<.*?>","")
   html = html.replaceAll("&nbsp;","");
   html = html.replaceAll("&amp;"."");
   ----------
   ----------
   return html;
}

Use this link for most common replacements you need: http://tunes.org/wiki/html_20special_20characters_20and_20symbols.html

使用这个链接可以找到最常见的替代方法:http://tunes.org/wiki/html_20special_20characters_20and_20symbols.html。

It is simple but effective. I use this method first to remove the junk but not the very first line i.e replaceAll("\<.*?>",""), and later i use specific keywords to search for indexes and then use .substring(start, end) method to strip away unnecessary stuff. As this is more robust and you can pin point exactly what you need in the entire html page.

它简单而有效。我先用这个方法去除垃圾,但不是第一行。然后,我使用特定的关键字搜索索引,然后使用.substring(start, end)方法去掉不必要的东西。因为它更健壮,您可以在整个html页面中准确地定位所需的内容。

#1


467  

Use a HTML parser instead of regex. This is dead simple with Jsoup.

使用HTML解析器代替正则表达式。这对Jsoup来说很简单。

public static String html2text(String html) {
    return Jsoup.parse(html).text();
}

Jsoup also supports removing HTML tags against a customizable whitelist, which is very useful if you want to allow only e.g. <b>, <i> and <u>.

Jsoup还支持将HTML标记移到可定制的白名单上,如果您想只允许这样做的话,这非常有用。

See also:

#2


250  

If you're writing for Android you can do this...

如果你在为Android写文章,你可以这样做…

android.text.Html.fromHtml(instruction).toString()

#3


69  

If the user enters <b>hey!</b>, do you want to display <b>hey!</b> or hey!? If the first, escape less-thans, and html-encode ampersands (and optionally quotes) and you're fine. A modification to your code to implement the second option would be:

如果用户输入嘿!,你想要显示嘿!< / b >还是嘿! ?如果是第一个,则要避免使用较少的thans和html编码的ampersands(以及可选的引号),这样就可以了。修改您的代码以实现第二个选项将是:

replaceAll("\\<[^>]*>","")

but you will run into issues if the user enters something malformed, like <bhey!</b>.

但是如果用户输入了一些格式不正确的东西,比如 ,你就会遇到问题。 !

You can also check out JTidy which will parse "dirty" html input, and should give you a way to remove the tags, keeping the text.

您还可以检查JTidy,它将解析“脏”的html输入,并且应该给您一种方法来删除标记,保留文本。

The problem with trying to strip html is that browsers have very lenient parsers, more lenient than any library you can find will, so even if you do your best to strip all tags (using the replace method above, a DOM library, or JTidy), you will still need to make sure to encode any remaining HTML special characters to keep your output safe.

试图剥夺html浏览器的问题有非常宽松的解析器,比任何图书馆你可以找到更宽松,所以即使你最好带所有标签(上面使用替代方法,DOM库,或JTidy),您仍然需要确保任何剩余的编码输出html特殊字符来保持你的安全。

#4


26  

Another way is to use javax.swing.text.html.HTMLEditorKit to extract the text.

另一种方法是使用javax.swing.text.html。HTMLEditorKit提取文本。

import java.io.*;
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;

public class Html2Text extends HTMLEditorKit.ParserCallback {
    StringBuffer s;

    public Html2Text() {
    }

    public void parse(Reader in) throws IOException {
        s = new StringBuffer();
        ParserDelegator delegator = new ParserDelegator();
        // the third parameter is TRUE to ignore charset directive
        delegator.parse(in, this, Boolean.TRUE);
    }

    public void handleText(char[] text, int pos) {
        s.append(text);
    }

    public String getText() {
        return s.toString();
    }

    public static void main(String[] args) {
        try {
            // the HTML to convert
            FileReader in = new FileReader("java-new.html");
            Html2Text parser = new Html2Text();
            parser.parse(in);
            in.close();
            System.out.println(parser.getText());
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

ref : Remove HTML tags from a file to extract only the TEXT

ref:从文件中删除HTML标记,只提取文本。

#5


17  

Also very simple using Jericho, and you can retain some of the formatting (line breaks and links, for example).

同样非常简单的使用杰里科,你可以保留一些格式(例如,换行和链接)。

    Source htmlSource = new Source(htmlText);
    Segment htmlSeg = new Segment(htmlSource, 0, htmlSource.length());
    Renderer htmlRend = new Renderer(htmlSeg);
    System.out.println(htmlRend.toString());

#6


16  

I think that the simpliest way to filter the html tags is:

我认为最简单的过滤html标签的方法是:

private static final Pattern REMOVE_TAGS = Pattern.compile("<.+?>");

public static String removeTags(String string) {
    if (string == null || string.length() == 0) {
        return string;
    }

    Matcher m = REMOVE_TAGS.matcher(string);
    return m.replaceAll("");
}

#7


13  

On Android, try this:

在Android上,试试这个:

String result = Html.fromHtml(html).toString();

#8


11  

HTML Escaping is really hard to do right- I'd definitely suggest using library code to do this, as it's a lot more subtle than you'd think. Check out Apache's StringEscapeUtils for a pretty good library for handling this in Java.

HTML的转义是很难做到的——我绝对建议使用库代码来做这个,因为它比你想象的要微妙得多。请查看Apache的StringEscapeUtils,它是用于处理Java的一个非常好的库。

#9


10  

The accepted answer of doing simply Jsoup.parse(html).text() has 2 potential issues (with JSoup 1.7.3):

简单的JSoup .parse(html).text()有两个潜在的问题(JSoup 1.7.3):

  • It removes line breaks from the text
  • 它删除了文本中的断行。
  • It converts text &lt;script&gt; into <script>
  • 它将文本& lt;script>在 <脚本>

If you use this to protect against XSS, this is a bit annoying. Here is my best shot at an improved solution, using both JSoup and Apache StringEscapeUtils:

如果你用这个来保护XSS,这有点烦人。这里是我最好的一个改进的解决方案,使用JSoup和Apache StringEscapeUtils:

// breaks multi-level of escaping, preventing &amp;lt;script&amp;gt; to be rendered as <script>
String replace = input.replace("&amp;", "");
// decode any encoded html, preventing &lt;script&gt; to be rendered as <script>
String html = StringEscapeUtils.unescapeHtml(replace);
// remove all html tags, but maintain line breaks
String clean = Jsoup.clean(html, "", Whitelist.none(), new Document.OutputSettings().prettyPrint(false));
// decode html again to convert character entities back into text
return StringEscapeUtils.unescapeHtml(clean);

Note that the last step is because I need to use the output as plain text. If you need only HTML output then you should be able to remove it.

请注意,最后一步是我需要使用输出作为纯文本。如果您只需要HTML输出,那么您应该能够删除它。

And here is a bunch of test cases (input to output):

这里有一些测试用例(输入输出):

{"regular string", "regular string"},
{"<a href=\"link\">A link</a>", "A link"},
{"<script src=\"http://evil.url.com\"/>", ""},
{"&lt;script&gt;", ""},
{"&amp;lt;script&amp;gt;", "lt;scriptgt;"}, // best effort
{"\" ' > < \n \\ é å à ü and & preserved", "\" ' > < \n \\ é å à ü and & preserved"}

If you find a way to make it better, please let me know.

如果你想办法做得更好,请告诉我。

#10


5  

You might want to replace <br/> and </p> tags with newlines before stripping the HTML to prevent it becoming an illegible mess as Tim suggests.

在删除HTML之前,您可能想要用换行符替换

标记,以防止它变成像Tim建议的那样难以理解的混乱。

The only way I can think of removing HTML tags but leaving non-HTML between angle brackets would be check against a list of HTML tags. Something along these lines...

我唯一能想到的方法是删除HTML标签,但在尖括号之间留下非HTML标签,将会对HTML标签列表进行检查。沿着这些方向的东西……

replaceAll("\\<[\s]*tag[^>]*>","")

Then HTML-decode special characters such as &amp;. The result should not be considered to be sanitized.

然后是HTML-decode特殊字符,如&结果不应该被认为是经过消毒的。

#11


3  

The accepted answer did not work for me for the test case I indicated: the result of "a < b or b > c" is "a b or b > c".

对于我所指出的测试用例,被接受的答案并没有起作用:“a < b或b > c”的结果是“a b或b > c”。

So, I used TagSoup instead. Here's a shot that worked for my test case (and a couple of others):

所以,我用的是TagSoup。这里有一个为我的测试用例(以及其他几个)工作的照片:

import java.io.IOException;
import java.io.StringReader;
import java.util.logging.Logger;

import org.ccil.cowan.tagsoup.Parser;
import org.xml.sax.Attributes;
import org.xml.sax.ContentHandler;
import org.xml.sax.InputSource;
import org.xml.sax.Locator;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;

/**
 * Take HTML and give back the text part while dropping the HTML tags.
 *
 * There is some risk that using TagSoup means we'll permute non-HTML text.
 * However, it seems to work the best so far in test cases.
 *
 * @author dan
 * @see <a href="http://home.ccil.org/~cowan/XML/tagsoup/">TagSoup</a> 
 */
public class Html2Text2 implements ContentHandler {
private StringBuffer sb;

public Html2Text2() {
}

public void parse(String str) throws IOException, SAXException {
    XMLReader reader = new Parser();
    reader.setContentHandler(this);
    sb = new StringBuffer();
    reader.parse(new InputSource(new StringReader(str)));
}

public String getText() {
    return sb.toString();
}

@Override
public void characters(char[] ch, int start, int length)
    throws SAXException {
    for (int idx = 0; idx < length; idx++) {
    sb.append(ch[idx+start]);
    }
}

@Override
public void ignorableWhitespace(char[] ch, int start, int length)
    throws SAXException {
    sb.append(ch);
}

// The methods below do not contribute to the text
@Override
public void endDocument() throws SAXException {
}

@Override
public void endElement(String uri, String localName, String qName)
    throws SAXException {
}

@Override
public void endPrefixMapping(String prefix) throws SAXException {
}


@Override
public void processingInstruction(String target, String data)
    throws SAXException {
}

@Override
public void setDocumentLocator(Locator locator) {
}

@Override
public void skippedEntity(String name) throws SAXException {
}

@Override
public void startDocument() throws SAXException {
}

@Override
public void startElement(String uri, String localName, String qName,
    Attributes atts) throws SAXException {
}

@Override
public void startPrefixMapping(String prefix, String uri)
    throws SAXException {
}
}

#12


3  

Here's a lightly more fleshed out update to try to handle some formatting for breaks and lists. I used Amaya's output as a guide.

这里有一个稍微充实的更新,以尝试处理中断和列表的一些格式。我用Amaya的输出作为向导。

import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;
import java.util.Stack;
import java.util.logging.Logger;

import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.HTML;
import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.html.parser.ParserDelegator;

public class HTML2Text extends HTMLEditorKit.ParserCallback {
    private static final Logger log = Logger
            .getLogger(Logger.GLOBAL_LOGGER_NAME);

    private StringBuffer stringBuffer;

    private Stack<IndexType> indentStack;

    public static class IndexType {
        public String type;
        public int counter; // used for ordered lists

        public IndexType(String type) {
            this.type = type;
            counter = 0;
        }
    }

    public HTML2Text() {
        stringBuffer = new StringBuffer();
        indentStack = new Stack<IndexType>();
    }

    public static String convert(String html) {
        HTML2Text parser = new HTML2Text();
        Reader in = new StringReader(html);
        try {
            // the HTML to convert
            parser.parse(in);
        } catch (Exception e) {
            log.severe(e.getMessage());
        } finally {
            try {
                in.close();
            } catch (IOException ioe) {
                // this should never happen
            }
        }
        return parser.getText();
    }

    public void parse(Reader in) throws IOException {
        ParserDelegator delegator = new ParserDelegator();
        // the third parameter is TRUE to ignore charset directive
        delegator.parse(in, this, Boolean.TRUE);
    }

    public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {
        log.info("StartTag:" + t.toString());
        if (t.toString().equals("p")) {
            if (stringBuffer.length() > 0
                    && !stringBuffer.substring(stringBuffer.length() - 1)
                            .equals("\n")) {
                newLine();
            }
            newLine();
        } else if (t.toString().equals("ol")) {
            indentStack.push(new IndexType("ol"));
            newLine();
        } else if (t.toString().equals("ul")) {
            indentStack.push(new IndexType("ul"));
            newLine();
        } else if (t.toString().equals("li")) {
            IndexType parent = indentStack.peek();
            if (parent.type.equals("ol")) {
                String numberString = "" + (++parent.counter) + ".";
                stringBuffer.append(numberString);
                for (int i = 0; i < (4 - numberString.length()); i++) {
                    stringBuffer.append(" ");
                }
            } else {
                stringBuffer.append("*   ");
            }
            indentStack.push(new IndexType("li"));
        } else if (t.toString().equals("dl")) {
            newLine();
        } else if (t.toString().equals("dt")) {
            newLine();
        } else if (t.toString().equals("dd")) {
            indentStack.push(new IndexType("dd"));
            newLine();
        }
    }

    private void newLine() {
        stringBuffer.append("\n");
        for (int i = 0; i < indentStack.size(); i++) {
            stringBuffer.append("    ");
        }
    }

    public void handleEndTag(HTML.Tag t, int pos) {
        log.info("EndTag:" + t.toString());
        if (t.toString().equals("p")) {
            newLine();
        } else if (t.toString().equals("ol")) {
            indentStack.pop();
            ;
            newLine();
        } else if (t.toString().equals("ul")) {
            indentStack.pop();
            ;
            newLine();
        } else if (t.toString().equals("li")) {
            indentStack.pop();
            ;
            newLine();
        } else if (t.toString().equals("dd")) {
            indentStack.pop();
            ;
        }
    }

    public void handleSimpleTag(HTML.Tag t, MutableAttributeSet a, int pos) {
        log.info("SimpleTag:" + t.toString());
        if (t.toString().equals("br")) {
            newLine();
        }
    }

    public void handleText(char[] text, int pos) {
        log.info("Text:" + new String(text));
        stringBuffer.append(text);
    }

    public String getText() {
        return stringBuffer.toString();
    }

    public static void main(String args[]) {
        String html = "<html><body><p>paragraph at start</p>hello<br />What is happening?<p>this is a<br />mutiline paragraph</p><ol>  <li>This</li>  <li>is</li>  <li>an</li>  <li>ordered</li>  <li>list    <p>with</p>    <ul>      <li>another</li>      <li>list        <dl>          <dt>This</dt>          <dt>is</dt>            <dd>sdasd</dd>            <dd>sdasda</dd>            <dd>asda              <p>aasdas</p>            </dd>            <dd>sdada</dd>          <dt>fsdfsdfsd</dt>        </dl>        <dl>          <dt>vbcvcvbcvb</dt>          <dt>cvbcvbc</dt>            <dd>vbcbcvbcvb</dd>          <dt>cvbcv</dt>          <dt></dt>        </dl>        <dl>          <dt></dt>        </dl></li>      <li>cool</li>    </ul>    <p>stuff</p>  </li>  <li>cool</li></ol><p></p></body></html>";
        System.out.println(convert(html));
    }
}

#13


3  

Use Html.fromHtml

使用Html.fromHtml

HTML Tags are

HTML标签

<a href=”…”> <b>,  <big>, <blockquote>, <br>, <cite>, <dfn>
<div align=”…”>,  <em>, <font size=”…” color=”…” face=”…”>
<h1>,  <h2>, <h3>, <h4>,  <h5>, <h6>
<i>, <p>, <small>
<strike>,  <strong>, <sub>, <sup>, <tt>, <u>

As per Android’s official Documentations any tags in the HTML will display as a generic replacement String which your program can then go through and replace with real strings.

根据Android的官方文档,HTML中的任何标记都将显示为一个通用的替换字符串,您的程序可以通过这个字符串来替换字符串。

Html.formHtml method takes an Html.TagHandler and an Html.ImageGetter as arguments as well as the text to parse.

Html。formHtml方法采用Html。TagHandler和Html。ImageGetter作为参数,以及用于解析的文本。

Example

String Str_Html=" <p>This is about me text that the user can put into their profile</p> ";

Then

然后

Your_TextView_Obj.setText(Html.fromHtml(Str_Html).toString());

Output

输出

This is about me text that the user can put into their profile

这是关于我的文本,用户可以放入他们的档案。

#14


2  

One more way can be to use com.google.gdata.util.common.html.HtmlToText class like

另一种方法是使用com.google.gdata.util.common.html.HtmlToText类。

MyWriter.toConsole(HtmlToText.htmlToPlainText(htmlResponse));

This is not bullet proof code though and when I run it on wikipedia entries I am getting style info also. However I believe for small/simple jobs this would be effective.

这不是防弹代码,当我在*条目上运行时,我也得到了样式信息。然而,我相信这将是有效的小/简单的工作。

#15


2  

I know this is old, but I was just working on a project that required me to filter HTML and this worked fine:

我知道这是旧的,但我只是在做一个项目,需要我过滤HTML,这个工作很好:

noHTMLString.replaceAll("\\&.*?\\;", "");

instead of this:

而不是:

html = html.replaceAll("&nbsp;","");
html = html.replaceAll("&amp;"."");

#16


2  

It sounds like you want to go from HTML to plain text.
If that is the case look at www.htmlparser.org. Here is an example that strips all the tags out from the html file found at a URL.
It makes use of org.htmlparser.beans.StringBean.

听起来你想从HTML到纯文本。如果是这样的话,请查看www.htmlparser.org。这里有一个示例,它从URL中找到的html文件中去掉所有标记。它使用了org.htmlparser.beans.StringBean。

static public String getUrlContentsAsText(String url) {
    String content = "";
    StringBean stringBean = new StringBean();
    stringBean.setURL(url);
    content = stringBean.getStrings();
    return content;
}

#17


2  

Here is another way to do it:

下面是另一种方法:

public static String removeHTML(String input) {
    int i = 0;
    String[] str = input.split("");

    String s = "";
    boolean inTag = false;

    for (i = input.indexOf("<"); i < input.indexOf(">"); i++) {
        inTag = true;
    }
    if (!inTag) {
        for (i = 0; i < str.length; i++) {
            s = s + str[i];
        }
    }
    return s;
}

#18


2  

Alternatively, one can use HtmlCleaner:

或者,可以使用HtmlCleaner:

private CharSequence removeHtmlFrom(String html) {
    return new HtmlCleaner().clean(html).getText();
}

#19


2  

This should work -

这应该工作

use this

使用这个

  text.replaceAll('<.*?>' , " ") -> This will replace all the html tags with a space.

and this

  text.replaceAll('&.*?;' , "")-> this will replace all the tags which starts with "&" and ends with ";" like &nbsp;, &amp;, &gt; etc.

#20


1  

One could also use Apache Tika for this purpose. By default it preserves whitespaces from the stripped html, which may be desired in certain situations:

为此,还可以使用Apache Tika。默认情况下,它保留了被剥离的html的空白,这在某些情况下可能是需要的:

InputStream htmlInputStream = ..
HtmlParser htmlParser = new HtmlParser();
HtmlContentHandler htmlContentHandler = new HtmlContentHandler();
htmlParser.parse(htmlInputStream, htmlContentHandler, new Metadata())
System.out.println(htmlContentHandler.getBodyText().trim())

#21


0  

My 5 cents:

我的5美分。

String[] temp = yourString.split("&amp;");
String tmp = "";
if (temp.length > 1) {

    for (int i = 0; i < temp.length; i++) {
        tmp += temp[i] + "&";
    }
    yourString = tmp.substring(0, tmp.length() - 1);
}

#22


0  

To get formateed plain html text you can do that:

要得到格式化的纯html文本,你可以这样做:

String BR_ESCAPED = "&lt;br/&gt;";
Element el=Jsoup.parse(html).select("body");
el.select("br").append(BR_ESCAPED);
el.select("p").append(BR_ESCAPED+BR_ESCAPED);
el.select("h1").append(BR_ESCAPED+BR_ESCAPED);
el.select("h2").append(BR_ESCAPED+BR_ESCAPED);
el.select("h3").append(BR_ESCAPED+BR_ESCAPED);
el.select("h4").append(BR_ESCAPED+BR_ESCAPED);
el.select("h5").append(BR_ESCAPED+BR_ESCAPED);
String nodeValue=el.text();
nodeValue=nodeValue.replaceAll(BR_ESCAPED, "<br/>");
nodeValue=nodeValue.replaceAll("(\\s*<br[^>]*>){3,}", "<br/><br/>");

To get formateed plain text change <br/> by \n and change last line by:

要得到格式化的纯文本更改
/ \n并更改最后一行:

nodeValue=nodeValue.replaceAll("(\\s*\n){3,}", "<br/><br/>");

#23


0  

Remove HTML tags from string. Somewhere we need to parse some string which is received by some responses like Httpresponse from the server.

从字符串中删除HTML标记。在某个地方,我们需要解析一些响应,比如来自服务器的Httpresponse之类的响应。

So we need to parse it.

所以我们需要解析它。

Here I will show how to remove html tags from string.

这里我将展示如何从字符串中删除html标记。

    // sample text with tags

    string str = "<html><head>sdfkashf sdf</head><body>sdfasdf</body></html>";



    // regex which match tags

    System.Text.RegularExpressions.Regex rx = new System.Text.RegularExpressions.Regex("<[^>]*>");



    // replace all matches with empty strin

    str = rx.Replace(str, "");



    //now str contains string without html tags

#24


0  

One way to retain new-line info with JSoup is to precede all new line tags with some dummy string, execute JSoup and replace dummy string with "\n".

用JSoup保留新行信息的一种方法是在所有新行标记前面加上一些假字符串,执行JSoup并将假字符串替换为“\n”。

String html = "<p>Line one</p><p>Line two</p>Line three<br/>etc.";
String NEW_LINE_MARK = "NEWLINESTART1234567890NEWLINEEND";
for (String tag: new String[]{"</p>","<br/>","</h1>","</h2>","</h3>","</h4>","</h5>","</h6>","</li>"}) {
    html = html.replace(tag, NEW_LINE_MARK+tag);
}

String text = Jsoup.parse(html).text();

text = text.replace(NEW_LINE_MARK + " ", "\n\n");
text = text.replace(NEW_LINE_MARK, "\n\n");

#25


0  

ex: classeString.replaceAll("\<(/?[^\>]+)\>", "\ ").replaceAll("\s+", " ").trim()

例:classeString.replaceAll(“\ <(/ ?[^ \ >]+)\ >”,“\”)。replaceAll(“\ s + "," ").trim()

#26


-1  

you can simply make a method with multiple replaceAll() like

您可以简单地创建一个具有多个replaceAll()的方法。

String RemoveTag(String html){
   html = html.replaceAll("\\<.*?>","")
   html = html.replaceAll("&nbsp;","");
   html = html.replaceAll("&amp;"."");
   ----------
   ----------
   return html;
}

Use this link for most common replacements you need: http://tunes.org/wiki/html_20special_20characters_20and_20symbols.html

使用这个链接可以找到最常见的替代方法:http://tunes.org/wiki/html_20special_20characters_20and_20symbols.html。

It is simple but effective. I use this method first to remove the junk but not the very first line i.e replaceAll("\<.*?>",""), and later i use specific keywords to search for indexes and then use .substring(start, end) method to strip away unnecessary stuff. As this is more robust and you can pin point exactly what you need in the entire html page.

它简单而有效。我先用这个方法去除垃圾,但不是第一行。然后,我使用特定的关键字搜索索引,然后使用.substring(start, end)方法去掉不必要的东西。因为它更健壮,您可以在整个html页面中准确地定位所需的内容。