用指向url的链接替换文本中的url

时间:2021-01-05 20:08:44

Using Python I want to replace all URLs in a body of text with links to those URLs, like what Gmail does. Can this be done in a one liner regular expression?

使用Python,我希望将文本体中的所有url替换为指向这些url的链接,就像Gmail所做的那样。这能在一个线性正则表达式中完成吗?

Edit: by body of text I just meant plain text - no HTML

编辑:通过正文,我的意思是纯文本-没有HTML。

5 个解决方案

#1


9  

You can load the document up with a DOM/HTML parsing library ( see html5lib ), grab all text nodes, match them against a regular expression and replace the text nodes with a regex replacement of the URI with anchors around it using a PCRE such as:

您可以使用DOM/HTML解析库(参见html5lib)加载文档,获取所有文本节点,将它们与正则表达式匹配,并使用regex替换为URI,并使用PCRE(如:

/(https?:[;\/?\\@&=+$,\[\]A-Za-z0-9\-_\.\!\~\*\'\(\)%][\;\/\?\:\@\&\=\+\$\,\[\]A-Za-z0-9\-_\.\!\~\*\'\(\)%#]*|[KZ]:\\*.*\w+)/g

I'm quite sure you can scourge through and find some sort of utility that does this, I can't think of any off the top of my head though.

我敢肯定你能找出一些有用的东西来做这个,但我想不出有什么东西能做得到。

Edit: Try using the answers here: How do I get python-markdown to additionally "urlify" links when formatting plain text?

编辑:试着在这里使用答案:当格式化纯文本时,如何让pythonmarkdown添加“urlify”链接?

import re

urlfinder = re.compile("([0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}|((news|telnet|nttp|file|http|ftp|https)://)|(www|ftp)[-A-Za-z0-9]*\\.)[-A-Za-z0-9\\.]+):[0-9]*)?/[-A-Za-z0-9_\\$\\.\\+\\!\\*\\(\\),;:@&=\\?/~\\#\\%]*[^]'\\.}>\\),\\\"]")

def urlify2(value):
    return urlfinder.sub(r'<a href="\1">\1</a>', value)

call urlify2 on a string and I think that's it if you aren't dealing with a DOM object.

在字符串上调用urlify2,我想如果你不处理DOM对象,那就是它了。

#2


6  

I hunted around a lot, tried these solutions and was not happy with their readability or features, so I rolled the following:

我找了很多遍,尝试了这些解决方案,但对它们的可读性或特性并不满意,所以我写了以下内容:

_urlfinderregex = re.compile(r'http([^\.\s]+\.[^\.\s]*)+[^\.\s]{2,}')

def linkify(text, maxlinklength):
    def replacewithlink(matchobj):
        url = matchobj.group(0)
        text = unicode(url)
        if text.startswith('http://'):
            text = text.replace('http://', '', 1)
        elif text.startswith('https://'):
            text = text.replace('https://', '', 1)

        if text.startswith('www.'):
            text = text.replace('www.', '', 1)

        if len(text) > maxlinklength:
            halflength = maxlinklength / 2
            text = text[0:halflength] + '...' + text[len(text) - halflength:]

        return '<a class="comurl" href="' + url + '" target="_blank" rel="nofollow">' + text + '<img class="imglink" src="/images/linkout.png"></a>'

    if text != None and text != '':
        return _urlfinderregex.sub(replacewithlink, text)
    else:
        return ''

You'll need to get a link out image, but that's pretty easy. This is specifically for user submitted text like comments which I assume is usually what people are dealing with.

你需要得到一个链接的图像,但这很简单。这是专门针对用户提交的文本,如评论,我认为这通常是人们正在处理的。

#3


1  

/\w+:\/\/[^\s]+/

#4


0  

When you say "body of text" do you mean a plain text file, or body text in an HTML document? If you want the HTML document, you will want to use Beautiful Soup to parse it; then, search through the body text and insert the tags.

当你说“正文”时,你指的是纯文本文件还是HTML文档中的正文?如果您想要HTML文档,您需要使用漂亮的Soup来解析它;然后,搜索正文文本并插入标签。

Matching the actual URLs is probably best done with the urlparse module. Full discussion here: How do you validate a URL with a regular expression in Python?

匹配实际的url最好使用urlparse模块。完整的讨论:如何在Python中使用正则表达式验证URL ?

#5


0  

Gmail is a lot more open, when it comes to URLs, but it is not always right either. e.g. it will make www.a.b into a hyperlink as well as http://a.b but it often fails because of wrapped text and uncommon (but valid) URL characters.

当涉及到url时,Gmail要开放得多,但它也不总是正确的。它将使www.a.b和http://a同时成为一个超链接。但是由于包装文本和不常见的(但有效的)URL字符,它经常失败。

See appendix A. A. Collected BNF for URI for syntax, and use that to build a reasonable regular expression that will consider what surrounds the URL as well. You'd be well advised to consider a couple of scenarios where URLs might end up.

请参阅附录a .收集的BNF以获得语法,并使用它构建一个合理的正则表达式,该表达式将考虑URL周围的内容。您最好考虑几个url可能会终止的场景。

#1


9  

You can load the document up with a DOM/HTML parsing library ( see html5lib ), grab all text nodes, match them against a regular expression and replace the text nodes with a regex replacement of the URI with anchors around it using a PCRE such as:

您可以使用DOM/HTML解析库(参见html5lib)加载文档,获取所有文本节点,将它们与正则表达式匹配,并使用regex替换为URI,并使用PCRE(如:

/(https?:[;\/?\\@&=+$,\[\]A-Za-z0-9\-_\.\!\~\*\'\(\)%][\;\/\?\:\@\&\=\+\$\,\[\]A-Za-z0-9\-_\.\!\~\*\'\(\)%#]*|[KZ]:\\*.*\w+)/g

I'm quite sure you can scourge through and find some sort of utility that does this, I can't think of any off the top of my head though.

我敢肯定你能找出一些有用的东西来做这个,但我想不出有什么东西能做得到。

Edit: Try using the answers here: How do I get python-markdown to additionally "urlify" links when formatting plain text?

编辑:试着在这里使用答案:当格式化纯文本时,如何让pythonmarkdown添加“urlify”链接?

import re

urlfinder = re.compile("([0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}|((news|telnet|nttp|file|http|ftp|https)://)|(www|ftp)[-A-Za-z0-9]*\\.)[-A-Za-z0-9\\.]+):[0-9]*)?/[-A-Za-z0-9_\\$\\.\\+\\!\\*\\(\\),;:@&=\\?/~\\#\\%]*[^]'\\.}>\\),\\\"]")

def urlify2(value):
    return urlfinder.sub(r'<a href="\1">\1</a>', value)

call urlify2 on a string and I think that's it if you aren't dealing with a DOM object.

在字符串上调用urlify2,我想如果你不处理DOM对象,那就是它了。

#2


6  

I hunted around a lot, tried these solutions and was not happy with their readability or features, so I rolled the following:

我找了很多遍,尝试了这些解决方案,但对它们的可读性或特性并不满意,所以我写了以下内容:

_urlfinderregex = re.compile(r'http([^\.\s]+\.[^\.\s]*)+[^\.\s]{2,}')

def linkify(text, maxlinklength):
    def replacewithlink(matchobj):
        url = matchobj.group(0)
        text = unicode(url)
        if text.startswith('http://'):
            text = text.replace('http://', '', 1)
        elif text.startswith('https://'):
            text = text.replace('https://', '', 1)

        if text.startswith('www.'):
            text = text.replace('www.', '', 1)

        if len(text) > maxlinklength:
            halflength = maxlinklength / 2
            text = text[0:halflength] + '...' + text[len(text) - halflength:]

        return '<a class="comurl" href="' + url + '" target="_blank" rel="nofollow">' + text + '<img class="imglink" src="/images/linkout.png"></a>'

    if text != None and text != '':
        return _urlfinderregex.sub(replacewithlink, text)
    else:
        return ''

You'll need to get a link out image, but that's pretty easy. This is specifically for user submitted text like comments which I assume is usually what people are dealing with.

你需要得到一个链接的图像,但这很简单。这是专门针对用户提交的文本,如评论,我认为这通常是人们正在处理的。

#3


1  

/\w+:\/\/[^\s]+/

#4


0  

When you say "body of text" do you mean a plain text file, or body text in an HTML document? If you want the HTML document, you will want to use Beautiful Soup to parse it; then, search through the body text and insert the tags.

当你说“正文”时,你指的是纯文本文件还是HTML文档中的正文?如果您想要HTML文档,您需要使用漂亮的Soup来解析它;然后,搜索正文文本并插入标签。

Matching the actual URLs is probably best done with the urlparse module. Full discussion here: How do you validate a URL with a regular expression in Python?

匹配实际的url最好使用urlparse模块。完整的讨论:如何在Python中使用正则表达式验证URL ?

#5


0  

Gmail is a lot more open, when it comes to URLs, but it is not always right either. e.g. it will make www.a.b into a hyperlink as well as http://a.b but it often fails because of wrapped text and uncommon (but valid) URL characters.

当涉及到url时,Gmail要开放得多,但它也不总是正确的。它将使www.a.b和http://a同时成为一个超链接。但是由于包装文本和不常见的(但有效的)URL字符,它经常失败。

See appendix A. A. Collected BNF for URI for syntax, and use that to build a reasonable regular expression that will consider what surrounds the URL as well. You'd be well advised to consider a couple of scenarios where URLs might end up.

请参阅附录a .收集的BNF以获得语法,并使用它构建一个合理的正则表达式,该表达式将考虑URL周围的内容。您最好考虑几个url可能会终止的场景。