使用Java +正则表达式从文本文档中提取url

时间:2022-11-12 10:49:45

I'm trying to create a regular expression to extract URLs from text documents using Java, but thus far I've been unsuccessful. The two cases I'm looking to capture are listed below:

我正在尝试创建一个正则表达式来使用Java从文本文档中提取URL,但到目前为止我还没有成功。我想要捕获的两个案例如下:

URLs that start with http:// URLs that start with www. (Missing the protocol from the front)

以http://以www开头的网址开头的网址。 (从前面缺少协议)

along with the query string parameters.

以及查询字符串参数。

Thanks! I wish I really knew Regular expressions better.

谢谢!我希望我真的更了解正则表达式。

Cheers,

干杯,

4 个解决方案

#1


26  

If you want to make sure you are really matching a url adress and not only some word starting with 'www.' you can use the expression mentioned by DVK before. I modified it slightly and wrote a small code snippet to be a starting point for you:

如果你想确保你真正匹配一个网址,而不仅仅是一个以'www'开头的单词。您可以使用DVK之前提到的表达式。我稍微修改了它并写了一个小代码片段作为你的起点:

import java.util.*;
import java.util.regex.*;

class FindUrls
{
    public static List<String> extractUrls(String input) {
        List<String> result = new ArrayList<String>();

        Pattern pattern = Pattern.compile(
            "\\b(((ht|f)tp(s?)\\:\\/\\/|~\\/|\\/)|www.)" + 
            "(\\w+:\\w+@)?(([-\\w]+\\.)+(com|org|net|gov" + 
            "|mil|biz|info|mobi|name|aero|jobs|museum" + 
            "|travel|[a-z]{2}))(:[\\d]{1,5})?" + 
            "(((\\/([-\\w~!$+|.,=]|%[a-f\\d]{2})+)+|\\/)+|\\?|#)?" + 
            "((\\?([-\\w~!$+|.,*:]|%[a-f\\d{2}])+=?" + 
            "([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)" + 
            "(&(?:[-\\w~!$+|.,*:]|%[a-f\\d{2}])+=?" + 
            "([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)*)*" + 
            "(#([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)?\\b");

        Matcher matcher = pattern.matcher(input);
        while (matcher.find()) {
            result.add(matcher.group());
        }

        return result;
    }
}

#2


5  

All RegEx -based code is over-engineered, especially code from the most voted answer, and here is why: it will find only valid URLs! As a sample, it will ignore anything starting with "http://" and having non-ASCII characters inside.

所有基于RegEx的代码都是过度设计的,特别是来自最多投票答案的代码,这就是为什么:它只能找到有效的URL!作为示例,它将忽略以“http://”开头且内部具有非ASCII字符的任何内容。

Even more: I have encountered 1-2-seconds processing times (single-threaded, dedicated) with Java RegEx package for very small and simple sentences, nothing specific; possibly bug in Java 6 RegEx...

更多:我遇到了1-2秒的处理时间(单线程,专用)和Java RegEx包,用于非常小而简单的句子,没有具体的内容;可能是Java 6 RegEx中的错误...

Simplest/Fastest solution would be to use StringTokenizer to split text into tokens, to remove tokens starting with "http://" etc., and to concatenate tokens into text again.

最简单/最快的解决方案是使用StringTokenizer将文本拆分为标记,删除以“http://”等开头的标记,并将标记再次连接到文本中。

If you really want to use RegEx with Java, try Automaton

如果您真的想将RegEx与Java一起使用,请尝试使用Automaton

#3


3  

This link has very good URL RegExs (they are surprisingly hard to get right, by the way - thinh http/https; port #s, valid characters, GET strings, pound signs for anchor links, etc...)

这个链接有很好的URL RegEx(顺便说一句,它们很难正确 - 瘦http / https;端口#,有效字符,GET字符串,锚链接的井号等等)

http://flanders.co.nz/2009/11/08/a-good-url-regular-expression-repost/

http://flanders.co.nz/2009/11/08/a-good-url-regular-expression-repost/

Perl has CPAN libraries that contain cannedRegExes, including for URLs. Not sure about Java though :(

Perl具有包含cannedRegExes的CPAN库,包括URL。虽然不确定Java :(

#4


1  

This tests a certain line if it is a URL

如果它是URL,则测试某一行

Pattern p = Pattern.compile("http://.*|www\\..*");
Matcher m = p.matcher("http://..."); // put here the line you want to check
if(m.matches()){
    so something
}

#1


26  

If you want to make sure you are really matching a url adress and not only some word starting with 'www.' you can use the expression mentioned by DVK before. I modified it slightly and wrote a small code snippet to be a starting point for you:

如果你想确保你真正匹配一个网址,而不仅仅是一个以'www'开头的单词。您可以使用DVK之前提到的表达式。我稍微修改了它并写了一个小代码片段作为你的起点:

import java.util.*;
import java.util.regex.*;

class FindUrls
{
    public static List<String> extractUrls(String input) {
        List<String> result = new ArrayList<String>();

        Pattern pattern = Pattern.compile(
            "\\b(((ht|f)tp(s?)\\:\\/\\/|~\\/|\\/)|www.)" + 
            "(\\w+:\\w+@)?(([-\\w]+\\.)+(com|org|net|gov" + 
            "|mil|biz|info|mobi|name|aero|jobs|museum" + 
            "|travel|[a-z]{2}))(:[\\d]{1,5})?" + 
            "(((\\/([-\\w~!$+|.,=]|%[a-f\\d]{2})+)+|\\/)+|\\?|#)?" + 
            "((\\?([-\\w~!$+|.,*:]|%[a-f\\d{2}])+=?" + 
            "([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)" + 
            "(&(?:[-\\w~!$+|.,*:]|%[a-f\\d{2}])+=?" + 
            "([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)*)*" + 
            "(#([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)?\\b");

        Matcher matcher = pattern.matcher(input);
        while (matcher.find()) {
            result.add(matcher.group());
        }

        return result;
    }
}

#2


5  

All RegEx -based code is over-engineered, especially code from the most voted answer, and here is why: it will find only valid URLs! As a sample, it will ignore anything starting with "http://" and having non-ASCII characters inside.

所有基于RegEx的代码都是过度设计的,特别是来自最多投票答案的代码,这就是为什么:它只能找到有效的URL!作为示例,它将忽略以“http://”开头且内部具有非ASCII字符的任何内容。

Even more: I have encountered 1-2-seconds processing times (single-threaded, dedicated) with Java RegEx package for very small and simple sentences, nothing specific; possibly bug in Java 6 RegEx...

更多:我遇到了1-2秒的处理时间(单线程,专用)和Java RegEx包,用于非常小而简单的句子,没有具体的内容;可能是Java 6 RegEx中的错误...

Simplest/Fastest solution would be to use StringTokenizer to split text into tokens, to remove tokens starting with "http://" etc., and to concatenate tokens into text again.

最简单/最快的解决方案是使用StringTokenizer将文本拆分为标记,删除以“http://”等开头的标记,并将标记再次连接到文本中。

If you really want to use RegEx with Java, try Automaton

如果您真的想将RegEx与Java一起使用,请尝试使用Automaton

#3


3  

This link has very good URL RegExs (they are surprisingly hard to get right, by the way - thinh http/https; port #s, valid characters, GET strings, pound signs for anchor links, etc...)

这个链接有很好的URL RegEx(顺便说一句,它们很难正确 - 瘦http / https;端口#,有效字符,GET字符串,锚链接的井号等等)

http://flanders.co.nz/2009/11/08/a-good-url-regular-expression-repost/

http://flanders.co.nz/2009/11/08/a-good-url-regular-expression-repost/

Perl has CPAN libraries that contain cannedRegExes, including for URLs. Not sure about Java though :(

Perl具有包含cannedRegExes的CPAN库,包括URL。虽然不确定Java :(

#4


1  

This tests a certain line if it is a URL

如果它是URL,则测试某一行

Pattern p = Pattern.compile("http://.*|www\\..*");
Matcher m = p.matcher("http://..."); // put here the line you want to check
if(m.matches()){
    so something
}