如何从URL获取域名?

时间:2022-08-23 10:37:36

How can I fetch a domain name from a URL String?

如何从URL字符串获取域名?

Examples:

+----------------------+------------+
| input                | output     |
+----------------------+------------+
| www.google.com       | google     |
| www.mail.yahoo.com   | mail.yahoo |
| www.mail.yahoo.co.in | mail.yahoo |
| www.abc.au.uk        | abc        |
+----------------------+------------+

Related:

相关:

17 个解决方案

#1


38  

I once had to write such a regex for a company I worked for. The solution was this:

我曾经为我工作过的公司写过这样的“regex”。解决方案是:

  • Get a list of every ccTLD and gTLD available. Your first stop should be IANA. The list from Mozilla looks great at first sight, but lacks ac.uk for example so for this it is not really usable.
  • 获取每个ccTLD和gTLD的列表。你的第一站应该是IANA。Mozilla的列表乍一看很不错,但是缺少ac.uk,因此它不能真正使用。
  • Join the list like the example below. A warning: Ordering is important! If org.uk would appear after uk then example.org.uk would match org instead of example.
  • 像下面的例子一样加入这个列表。警告:订购很重要!如果组织。英国会出现在英国之后。org。uk将会匹配org而不是example。

Example regex:

例如正则表达式:

.*([^\.]+)(com|net|org|info|coop|int|co\.uk|org\.uk|ac\.uk|uk|__and so on__)$

This worked really well and also matched weird, unofficial top-levels like de.com and friends.

这种方法效果很好,而且也与de.com和朋友等奇怪的非官方*网站相匹配。

The upside:

好处:

  • Very fast if regex is optimally ordered
  • 非常快,如果正则表达式是最优的。

The downside of this solution is of course:

这个解决方案的缺点当然是:

  • Handwritten regex which has to be updated manually if ccTLDs change or get added. Tedious job!
  • 手写的regex,如果ccTLDs更改或添加,必须手动更新。乏味的工作!
  • Very large regex so not very readable.
  • 非常大的正则表达式,所以可读性不强。

#2


12  

/^(?:www\.)?(.*?)\.(?:com|au\.uk|co\.in)$/

#3


4  

I don't know of any libraries, but the string manipulation of domain names is easy enough.

我不知道有什么库,但是对域名的字符串操作很容易。

The hard part is knowing if the name is at the second or third level. For this you will need a data file you maintain (e.g. for .uk is is not always the third level, some organisations (e.g. bl.uk, jet.uk) exist at the second level).

最难的部分是知道名字是否在第二或第三层。为此,您将需要一个您维护的数据文件(例如,For .uk is并不总是第三级,一些组织(例如,bl.uk, jet.uk)在第二层存在)。

The source of Firefox from Mozilla has such a data file, check the Mozilla licensing to see if you could reuse that.

Mozilla的Firefox源有这样一个数据文件,检查Mozilla的授权,看看是否可以重用它。

#4


3  

import urlparse

GENERIC_TLDS = [
    'aero', 'asia', 'biz', 'com', 'coop', 'edu', 'gov', 'info', 'int', 'jobs', 
    'mil', 'mobi', 'museum', 'name', 'net', 'org', 'pro', 'tel', 'travel', 'cat'
    ]

def get_domain(url):
    hostname = urlparse.urlparse(url.lower()).netloc
    if hostname == '':
        # Force the recognition as a full URL
        hostname = urlparse.urlparse('http://' + uri).netloc

    # Remove the 'user:passw', 'www.' and ':port' parts
    hostname = hostname.split('@')[-1].split(':')[0].lstrip('www.').split('.')

    num_parts = len(hostname)
    if (num_parts < 3) or (len(hostname[-1]) > 2):
        return '.'.join(hostname[:-1])
    if len(hostname[-2]) > 2 and hostname[-2] not in GENERIC_TLDS:
        return '.'.join(hostname[:-1])
    if num_parts >= 3:
        return '.'.join(hostname[:-2])

This code isn't guaranteed to work with all URLs and doesn't filter those that are grammatically correct but invalid like 'example.uk'.

这段代码不能保证与所有url一起工作,也不能过滤那些语法正确但无效的“example.uk”。

However it'll do the job in most cases.

然而,在大多数情况下,它会起到作用。

#5


3  

There are two ways

有两种方法

Using split

使用分割

Then just parse that string

然后解析这个字符串。

var domain;
//find & remove protocol (http, ftp, etc.) and get domain
if (url.indexOf('://') > -1) {
    domain = url.split('/')[2];
} if (url.indexOf('//') === 0) {
    domain = url.split('/')[2];
} else {
    domain = url.split('/')[0];
}

//find & remove port number
domain = domain.split(':')[0];

Using Regex

使用正则表达式

 var r = /:\/\/(.[^/]+)/;
 "http://*.com/questions/5343288/get-url".match(r)[1] 
 => *.com

Hope this helps

希望这有助于

#6


3  

Extracting the Domain name accurately can be quite tricky mainly because the domain extension can contain 2 parts (like .com.au or .co.uk) and the subdomain (the prefix) may or may not be there. Listing all domain extensions is not an option because there are hundreds of these. EuroDNS.com for example lists over 800 domain name extensions.

准确地提取域名可能相当棘手,因为域扩展可以包含2个部分(如com.au或.co.uk),而子域(前缀)可能存在也可能不存在。列出所有域扩展不是一个选项,因为有数百个这样的扩展。例如,EuroDNS.com列出了超过800个域名扩展名。

I therefore wrote a short php function that uses 'parse_url()' and some observations about domain extensions to accurately extract the url components AND the domain name. The function is as follows:

因此,我编写了一个简短的php函数,它使用“parse_url()”和一些关于域扩展的观察来准确提取url组件和域名。其功能如下:

function parse_url_all($url){
    $url = substr($url,0,4)=='http'? $url: 'http://'.$url;
    $d = parse_url($url);
    $tmp = explode('.',$d['host']);
    $n = count($tmp);
    if ($n>=2){
        if ($n==4 || ($n==3 && strlen($tmp[($n-2)])<=3)){
            $d['domain'] = $tmp[($n-3)].".".$tmp[($n-2)].".".$tmp[($n-1)];
            $d['domainX'] = $tmp[($n-3)];
        } else {
            $d['domain'] = $tmp[($n-2)].".".$tmp[($n-1)];
            $d['domainX'] = $tmp[($n-2)];
        }
    }
    return $d;
}

This simple function will work in almost every case. There are a few exceptions, but these are very rare.

这个简单的函数几乎适用于所有情况。有一些例外,但这是非常罕见的。

To demonstrate / test this function you can use the following:

为了演示/测试这个函数,您可以使用以下方法:

$urls = array('www.test.com', 'test.com', 'cp.test.com' .....);
echo "<div style='overflow-x:auto;'>";
echo "<table>";
echo "<tr><th>URL</th><th>Host</th><th>Domain</th><th>Domain X</th></tr>";
foreach ($urls as $url) {
    $info = parse_url_all($url);
    echo "<tr><td>".$url."</td><td>".$info['host'].
    "</td><td>".$info['domain']."</td><td>".$info['domainX']."</td></tr>";
}
echo "</table></div>";

The output will be as follows for the URL's listed:

该URL的输出将如下所示:

如何从URL获取域名?

As you can see, the domain name and the domain name without the extension are consistently extracted whatever the URL that is presented to the function.

如您所见,没有扩展名的域名和域名一直都在提取任何呈现给该函数的URL。

I hope that this helps.

我希望这能有所帮助。

#7


2  

Basically, what you want is:

基本上,你想要的是:

google.com        -> google.com    -> google
www.google.com    -> google.com    -> google
google.co.uk      -> google.co.uk  -> google
www.google.co.uk  -> google.co.uk  -> google
www.google.org    -> google.org    -> google
www.google.org.uk -> google.org.uk -> google

Optional:

可选:

www.google.com     -> google.com    -> www.google
images.google.com  -> google.com    -> images.google
mail.yahoo.co.uk   -> yahoo.co.uk   -> mail.yahoo
mail.yahoo.com     -> yahoo.com     -> mail.yahoo
www.mail.yahoo.com -> yahoo.com     -> mail.yahoo

You don't need to construct an ever-changing regex as 99% of domains will be matched properly if you simply look at the 2nd last part of the name:

你不需要构造一个千变万化的正则表达式,如果你只看名字的第二部分,你就可以正确地匹配99%的域。

(co|com|gov|net|org)

If it is one of these, then you need to match 3 dots, else 2. Simple. Now, my regex wizardry is no match for that of some other SO'ers, so the best way I've found to achieve this is with some code, assuming you've already stripped off the path:

如果它是其中之一,那么你需要匹配3个点,否则2。简单。现在,我的regex魔法不能与其他的一样,所以我找到的最好的方法是用一些代码,假设你已经脱离了路径:

 my @d=split /\./,$domain;                # split the domain part into an array
 $c=@d;                                   # count how many parts
 $dest=$d[$c-2].'.'.$d[$c-1];             # use the last 2 parts
 if ($d[$c-2]=~m/(co|com|gov|net|org)/) { # is the second-last part one of these?
   $dest=$d[$c-3].'.'.$dest;              # if so, add a third part
 };
 print $dest;                             # show it

To just get the name, as per your question:

根据你的问题,来取名字:

 my @d=split /\./,$domain;                # split the domain part into an array
 $c=@d;                                   # count how many parts
 if ($d[$c-2]=~m/(co|com|gov|net|org)/) { # is the second-last part one of these?
   $dest=$d[$c-3];                        # if so, give the third last
   $dest=$d[$c-4].'.'.$dest if ($c>3);    # optional bit
 } else {
   $dest=$d[$c-2];                        # else the second last
   $dest=$d[$c-3].'.'.$dest if ($c>2);    # optional bit 
 };
 print $dest;                             # show it

I like this approach because it's maintenance-free. Unless you want to validate that it's actually a legitimate domain, but that's kind of pointless because you're most likely only using this to process log files and an invalid domain wouldn't find its way in there in the first place.

我喜欢这种方法,因为它是免维护的。除非你想验证它实际上是一个合法的域,但这是毫无意义的,因为你很可能只会用这个来处理日志文件,一个无效的域不会在第一个地方找到它的路径。

If you'd like to match "unofficial" subdomains such as bozo.za.net, or bozo.au.uk, bozo.msf.ru just add (za|au|msf) to the regex.

如果你想匹配“非官方”的子域名,如bozo.za.net,或bozo.au。英国,bozo.msf.ru只把(za|au|msf)添加到regex。

I'd love to see someone do all of this using just a regex, I'm sure it's possible.

我很想看到有人用regex来做这些事情,我相信这是可能的。

#8


1  

/[^w{3}\.]([a-zA-Z0-9]([a-zA-Z0-9\-]{0,65}[a-zA-Z0-9])?\.)+[a-zA-Z]{2,6}/gim

/[w ^ { 3 } \]([a-zA-Z0-9]([a-zA-Z0-9 \ -]{ 0,65 }[a-zA-Z0-9])? \)+[a-zA-Z]{ 2,6 } / gim

usage of this javascript regex ignores www and following dot, while retaining the domain intact. also properly matches no www and cc tld

这个javascript regex的用法忽略了www和以下的点,同时保留了域的完整性。也适当地匹配不www和cc tld。

#9


1  

It is not possible without using a TLD list to compare with as their exist many cases like http://www.db.de/ or http://bbc.co.uk/

不使用TLD列表来比较它们的存在,这是不可能的,因为它们的存在很多情况都是类似于http://www.db.de/或http://bbc.co.uk/。

But even with that you won't have success in every case because of SLD's like http://big.uk.com/ or http://www.uk.com/

但即使是这样,你也不会成功,因为SLD就像http://big.uk.com/或http://www.uk.com/。

If you need a complete list you can use the public suffix list:

如果你需要一个完整的列表,你可以使用公共后缀列表:

http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1

http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1

Feel free to extend my function to extract the domain name, only. It won't use regex and it is fast:

请随意扩展我的函数来提取域名。它不会使用regex,而且速度很快:

http://www.programmierer-forum.de/domainnamen-ermitteln-t244185.htm#3471878

http://www.programmierer-forum.de/domainnamen-ermitteln-t244185.htm # 3471878

#10


0  

You need a list of what domain prefixes and suffixes can be removed. For example:

您需要一个可以删除哪些域前缀和后缀的列表。例如:

Prefixes:

前缀:

  • www.
  • www。

Suffixes:

后缀:

  • .com
  • com
  • .co.in
  • .co.in
  • .au.uk
  • .au.uk

#11


0  

So if you just have a string and not a window.location you could use...

如果你有一个字符串而不是一个窗口。位置可以使用…

String.prototype.toUrl = function(){

if(!this && 0 < this.length)
{
    return undefined;
}
var original = this.toString();
var s = original;
if(!original.toLowerCase().startsWith('http'))
{
    s = 'http://' + original;
}

s = this.split('/');

var protocol = s[0];
var host = s[2];
var relativePath = '';

if(s.length > 3){
    for(var i=3;i< s.length;i++)
    {
        relativePath += '/' + s[i];
    }
}

s = host.split('.');
var domain = s[s.length-2] + '.' + s[s.length-1];    

return {
    original: original,
    protocol: protocol,
    domain: domain,
    host: host,
    relativePath: relativePath,
    getParameter: function(param)
    {
        return this.getParameters()[param];
    },
    getParameters: function(){
        var vars = [], hash;
        var hashes = this.original.slice(this.original.indexOf('?') + 1).split('&');
        for (var i = 0; i < hashes.length; i++) {
            hash = hashes[i].split('=');
            vars.push(hash[0]);
            vars[hash[0]] = hash[1];
        }
        return vars;
    }
};};

How to use.

如何使用。

var str = "http://en.wikipedia.org/wiki/Knopf?q=1&t=2";
var url = str.toUrl;

var host = url.host;
var domain = url.domain;
var original = url.original;
var relativePath = url.relativePath;
var paramQ = url.getParameter('q');
var paramT = url.getParamter('t');

#12


0  

For a certain purpose I did this quick Python function yesterday. It returns domain from URL. It's quick and doesn't need any input file listing stuff. However, I don't pretend it works in all cases, but it really does the job I needed for a simple text mining script.

为了某种目的,我昨天做了这个快速Python函数。它从URL返回域。它是快速的,不需要任何输入文件列表的东西。但是,我并不认为它在所有情况下都适用,但是它确实能够完成我需要的一个简单文本挖掘脚本的工作。

Output looks like this :

输出如下:

http://www.google.co.uk => google.co.uk
http://24.media.tumblr.com/tumblr_m04s34rqh567ij78k_250.gif => tumblr.com

http://www.google.co.uk = > google.co。英国http://24.media.tumblr.com/tumblr_m04s34rqh567ij78k_250.gif = > tumblr.com

def getDomain(url):    
        parts = re.split("\/", url)
        match = re.match("([\w\-]+\.)*([\w\-]+\.\w{2,6}$)", parts[2]) 
        if match != None:
            if re.search("\.uk", parts[2]): 
                match = re.match("([\w\-]+\.)*([\w\-]+\.[\w\-]+\.\w{2,6}$)", parts[2])
            return match.group(2)
        else: return ''  

Seems to work pretty well.
However, it has to be modified to remove domain extensions on output as you wished.

看起来效果不错。但是,它必须被修改,以按照您希望的方式删除输出的域扩展。

#13


0  

Use this (.)(.*?)(.) then just extract the leading and end points. Easy, right?

使用这个(.)(.*?)(.)然后只提取开头和结束点。容易,对吧?

#14


0  

  1. how is this

    这是如何

    =((?:(?:(?:http)s?:)?\/\/)?(?:(?:[a-zA-Z0-9]+)\.?)*(?:(?:[a-zA-Z0-9]+))\.[a-zA-Z0-9]{2,3}) (you may want to add "\/" to end of pattern

    =((?:?:(?:http)年代?)? \ / \ /)?(?(?:[a-zA-Z0-9]+)\ ?)*(吗?(?:[a-zA-Z0-9]+))\[a-zA-Z0-9]{ 2,3 })(您可能希望添加“\ \”模式的终结

  2. if your goal is to rid url's passed in as a param you may add the equal sign as the first char, like:

    如果您的目标是将url作为一个param传递进来,您可以将等号作为第一个字符添加,比如:

    =((?:(?:(?:http)s?:)?//)?(?:(?:[a-zA-Z0-9]+).?)*(?:(?:[a-zA-Z0-9]+)).[a-zA-Z0-9]{2,3}/)

    =((?:?:(?:http)年代?)? / /)?(?(?:[a-zA-Z0-9]+)。?)*(吗?(?:[a-zA-Z0-9]+))。[a-zA-Z0-9]{ 2,3 } /)

    and replace with "/"

    ,用“/”来代替

The goal of this example to get rid of any domain name regardless of the form it appears in. (i.e. to ensure url parameters don't incldue domain names to avoid xss attack)

这个示例的目标是去掉任何域名,而不考虑它出现的形式。(例如,要确保url参数不包含域名以避免xss攻击)

#15


-1  

#!/usr/bin/perl -w
use strict;

my $url = $ARGV[0];
if($url =~ /([^:]*:\/\/)?([^\/]*\.)*([^\/\.]+)\.[^\/]+/g) {
  print $3;
}

#16


-1  

/^(?:https?:\/\/)?(?:www\.)?([^\/]+)/i

#17


-1  

Just for knowledge:

只是为了知识:

'http://api.livreto.co/books'.replace(/^(https?:\/\/)([a-z]{3}[0-9]?\.)?(\w+)(\.[a-zA-Z]{2,3})(\.[a-zA-Z]{2,3})?.*$/, '$3$4$5');

# returns livreto.co 

#1


38  

I once had to write such a regex for a company I worked for. The solution was this:

我曾经为我工作过的公司写过这样的“regex”。解决方案是:

  • Get a list of every ccTLD and gTLD available. Your first stop should be IANA. The list from Mozilla looks great at first sight, but lacks ac.uk for example so for this it is not really usable.
  • 获取每个ccTLD和gTLD的列表。你的第一站应该是IANA。Mozilla的列表乍一看很不错,但是缺少ac.uk,因此它不能真正使用。
  • Join the list like the example below. A warning: Ordering is important! If org.uk would appear after uk then example.org.uk would match org instead of example.
  • 像下面的例子一样加入这个列表。警告:订购很重要!如果组织。英国会出现在英国之后。org。uk将会匹配org而不是example。

Example regex:

例如正则表达式:

.*([^\.]+)(com|net|org|info|coop|int|co\.uk|org\.uk|ac\.uk|uk|__and so on__)$

This worked really well and also matched weird, unofficial top-levels like de.com and friends.

这种方法效果很好,而且也与de.com和朋友等奇怪的非官方*网站相匹配。

The upside:

好处:

  • Very fast if regex is optimally ordered
  • 非常快,如果正则表达式是最优的。

The downside of this solution is of course:

这个解决方案的缺点当然是:

  • Handwritten regex which has to be updated manually if ccTLDs change or get added. Tedious job!
  • 手写的regex,如果ccTLDs更改或添加,必须手动更新。乏味的工作!
  • Very large regex so not very readable.
  • 非常大的正则表达式,所以可读性不强。

#2


12  

/^(?:www\.)?(.*?)\.(?:com|au\.uk|co\.in)$/

#3


4  

I don't know of any libraries, but the string manipulation of domain names is easy enough.

我不知道有什么库,但是对域名的字符串操作很容易。

The hard part is knowing if the name is at the second or third level. For this you will need a data file you maintain (e.g. for .uk is is not always the third level, some organisations (e.g. bl.uk, jet.uk) exist at the second level).

最难的部分是知道名字是否在第二或第三层。为此,您将需要一个您维护的数据文件(例如,For .uk is并不总是第三级,一些组织(例如,bl.uk, jet.uk)在第二层存在)。

The source of Firefox from Mozilla has such a data file, check the Mozilla licensing to see if you could reuse that.

Mozilla的Firefox源有这样一个数据文件,检查Mozilla的授权,看看是否可以重用它。

#4


3  

import urlparse

GENERIC_TLDS = [
    'aero', 'asia', 'biz', 'com', 'coop', 'edu', 'gov', 'info', 'int', 'jobs', 
    'mil', 'mobi', 'museum', 'name', 'net', 'org', 'pro', 'tel', 'travel', 'cat'
    ]

def get_domain(url):
    hostname = urlparse.urlparse(url.lower()).netloc
    if hostname == '':
        # Force the recognition as a full URL
        hostname = urlparse.urlparse('http://' + uri).netloc

    # Remove the 'user:passw', 'www.' and ':port' parts
    hostname = hostname.split('@')[-1].split(':')[0].lstrip('www.').split('.')

    num_parts = len(hostname)
    if (num_parts < 3) or (len(hostname[-1]) > 2):
        return '.'.join(hostname[:-1])
    if len(hostname[-2]) > 2 and hostname[-2] not in GENERIC_TLDS:
        return '.'.join(hostname[:-1])
    if num_parts >= 3:
        return '.'.join(hostname[:-2])

This code isn't guaranteed to work with all URLs and doesn't filter those that are grammatically correct but invalid like 'example.uk'.

这段代码不能保证与所有url一起工作,也不能过滤那些语法正确但无效的“example.uk”。

However it'll do the job in most cases.

然而,在大多数情况下,它会起到作用。

#5


3  

There are two ways

有两种方法

Using split

使用分割

Then just parse that string

然后解析这个字符串。

var domain;
//find & remove protocol (http, ftp, etc.) and get domain
if (url.indexOf('://') > -1) {
    domain = url.split('/')[2];
} if (url.indexOf('//') === 0) {
    domain = url.split('/')[2];
} else {
    domain = url.split('/')[0];
}

//find & remove port number
domain = domain.split(':')[0];

Using Regex

使用正则表达式

 var r = /:\/\/(.[^/]+)/;
 "http://*.com/questions/5343288/get-url".match(r)[1] 
 => *.com

Hope this helps

希望这有助于

#6


3  

Extracting the Domain name accurately can be quite tricky mainly because the domain extension can contain 2 parts (like .com.au or .co.uk) and the subdomain (the prefix) may or may not be there. Listing all domain extensions is not an option because there are hundreds of these. EuroDNS.com for example lists over 800 domain name extensions.

准确地提取域名可能相当棘手,因为域扩展可以包含2个部分(如com.au或.co.uk),而子域(前缀)可能存在也可能不存在。列出所有域扩展不是一个选项,因为有数百个这样的扩展。例如,EuroDNS.com列出了超过800个域名扩展名。

I therefore wrote a short php function that uses 'parse_url()' and some observations about domain extensions to accurately extract the url components AND the domain name. The function is as follows:

因此,我编写了一个简短的php函数,它使用“parse_url()”和一些关于域扩展的观察来准确提取url组件和域名。其功能如下:

function parse_url_all($url){
    $url = substr($url,0,4)=='http'? $url: 'http://'.$url;
    $d = parse_url($url);
    $tmp = explode('.',$d['host']);
    $n = count($tmp);
    if ($n>=2){
        if ($n==4 || ($n==3 && strlen($tmp[($n-2)])<=3)){
            $d['domain'] = $tmp[($n-3)].".".$tmp[($n-2)].".".$tmp[($n-1)];
            $d['domainX'] = $tmp[($n-3)];
        } else {
            $d['domain'] = $tmp[($n-2)].".".$tmp[($n-1)];
            $d['domainX'] = $tmp[($n-2)];
        }
    }
    return $d;
}

This simple function will work in almost every case. There are a few exceptions, but these are very rare.

这个简单的函数几乎适用于所有情况。有一些例外,但这是非常罕见的。

To demonstrate / test this function you can use the following:

为了演示/测试这个函数,您可以使用以下方法:

$urls = array('www.test.com', 'test.com', 'cp.test.com' .....);
echo "<div style='overflow-x:auto;'>";
echo "<table>";
echo "<tr><th>URL</th><th>Host</th><th>Domain</th><th>Domain X</th></tr>";
foreach ($urls as $url) {
    $info = parse_url_all($url);
    echo "<tr><td>".$url."</td><td>".$info['host'].
    "</td><td>".$info['domain']."</td><td>".$info['domainX']."</td></tr>";
}
echo "</table></div>";

The output will be as follows for the URL's listed:

该URL的输出将如下所示:

如何从URL获取域名?

As you can see, the domain name and the domain name without the extension are consistently extracted whatever the URL that is presented to the function.

如您所见,没有扩展名的域名和域名一直都在提取任何呈现给该函数的URL。

I hope that this helps.

我希望这能有所帮助。

#7


2  

Basically, what you want is:

基本上,你想要的是:

google.com        -> google.com    -> google
www.google.com    -> google.com    -> google
google.co.uk      -> google.co.uk  -> google
www.google.co.uk  -> google.co.uk  -> google
www.google.org    -> google.org    -> google
www.google.org.uk -> google.org.uk -> google

Optional:

可选:

www.google.com     -> google.com    -> www.google
images.google.com  -> google.com    -> images.google
mail.yahoo.co.uk   -> yahoo.co.uk   -> mail.yahoo
mail.yahoo.com     -> yahoo.com     -> mail.yahoo
www.mail.yahoo.com -> yahoo.com     -> mail.yahoo

You don't need to construct an ever-changing regex as 99% of domains will be matched properly if you simply look at the 2nd last part of the name:

你不需要构造一个千变万化的正则表达式,如果你只看名字的第二部分,你就可以正确地匹配99%的域。

(co|com|gov|net|org)

If it is one of these, then you need to match 3 dots, else 2. Simple. Now, my regex wizardry is no match for that of some other SO'ers, so the best way I've found to achieve this is with some code, assuming you've already stripped off the path:

如果它是其中之一,那么你需要匹配3个点,否则2。简单。现在,我的regex魔法不能与其他的一样,所以我找到的最好的方法是用一些代码,假设你已经脱离了路径:

 my @d=split /\./,$domain;                # split the domain part into an array
 $c=@d;                                   # count how many parts
 $dest=$d[$c-2].'.'.$d[$c-1];             # use the last 2 parts
 if ($d[$c-2]=~m/(co|com|gov|net|org)/) { # is the second-last part one of these?
   $dest=$d[$c-3].'.'.$dest;              # if so, add a third part
 };
 print $dest;                             # show it

To just get the name, as per your question:

根据你的问题,来取名字:

 my @d=split /\./,$domain;                # split the domain part into an array
 $c=@d;                                   # count how many parts
 if ($d[$c-2]=~m/(co|com|gov|net|org)/) { # is the second-last part one of these?
   $dest=$d[$c-3];                        # if so, give the third last
   $dest=$d[$c-4].'.'.$dest if ($c>3);    # optional bit
 } else {
   $dest=$d[$c-2];                        # else the second last
   $dest=$d[$c-3].'.'.$dest if ($c>2);    # optional bit 
 };
 print $dest;                             # show it

I like this approach because it's maintenance-free. Unless you want to validate that it's actually a legitimate domain, but that's kind of pointless because you're most likely only using this to process log files and an invalid domain wouldn't find its way in there in the first place.

我喜欢这种方法,因为它是免维护的。除非你想验证它实际上是一个合法的域,但这是毫无意义的,因为你很可能只会用这个来处理日志文件,一个无效的域不会在第一个地方找到它的路径。

If you'd like to match "unofficial" subdomains such as bozo.za.net, or bozo.au.uk, bozo.msf.ru just add (za|au|msf) to the regex.

如果你想匹配“非官方”的子域名,如bozo.za.net,或bozo.au。英国,bozo.msf.ru只把(za|au|msf)添加到regex。

I'd love to see someone do all of this using just a regex, I'm sure it's possible.

我很想看到有人用regex来做这些事情,我相信这是可能的。

#8


1  

/[^w{3}\.]([a-zA-Z0-9]([a-zA-Z0-9\-]{0,65}[a-zA-Z0-9])?\.)+[a-zA-Z]{2,6}/gim

/[w ^ { 3 } \]([a-zA-Z0-9]([a-zA-Z0-9 \ -]{ 0,65 }[a-zA-Z0-9])? \)+[a-zA-Z]{ 2,6 } / gim

usage of this javascript regex ignores www and following dot, while retaining the domain intact. also properly matches no www and cc tld

这个javascript regex的用法忽略了www和以下的点,同时保留了域的完整性。也适当地匹配不www和cc tld。

#9


1  

It is not possible without using a TLD list to compare with as their exist many cases like http://www.db.de/ or http://bbc.co.uk/

不使用TLD列表来比较它们的存在,这是不可能的,因为它们的存在很多情况都是类似于http://www.db.de/或http://bbc.co.uk/。

But even with that you won't have success in every case because of SLD's like http://big.uk.com/ or http://www.uk.com/

但即使是这样,你也不会成功,因为SLD就像http://big.uk.com/或http://www.uk.com/。

If you need a complete list you can use the public suffix list:

如果你需要一个完整的列表,你可以使用公共后缀列表:

http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1

http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1

Feel free to extend my function to extract the domain name, only. It won't use regex and it is fast:

请随意扩展我的函数来提取域名。它不会使用regex,而且速度很快:

http://www.programmierer-forum.de/domainnamen-ermitteln-t244185.htm#3471878

http://www.programmierer-forum.de/domainnamen-ermitteln-t244185.htm # 3471878

#10


0  

You need a list of what domain prefixes and suffixes can be removed. For example:

您需要一个可以删除哪些域前缀和后缀的列表。例如:

Prefixes:

前缀:

  • www.
  • www。

Suffixes:

后缀:

  • .com
  • com
  • .co.in
  • .co.in
  • .au.uk
  • .au.uk

#11


0  

So if you just have a string and not a window.location you could use...

如果你有一个字符串而不是一个窗口。位置可以使用…

String.prototype.toUrl = function(){

if(!this && 0 < this.length)
{
    return undefined;
}
var original = this.toString();
var s = original;
if(!original.toLowerCase().startsWith('http'))
{
    s = 'http://' + original;
}

s = this.split('/');

var protocol = s[0];
var host = s[2];
var relativePath = '';

if(s.length > 3){
    for(var i=3;i< s.length;i++)
    {
        relativePath += '/' + s[i];
    }
}

s = host.split('.');
var domain = s[s.length-2] + '.' + s[s.length-1];    

return {
    original: original,
    protocol: protocol,
    domain: domain,
    host: host,
    relativePath: relativePath,
    getParameter: function(param)
    {
        return this.getParameters()[param];
    },
    getParameters: function(){
        var vars = [], hash;
        var hashes = this.original.slice(this.original.indexOf('?') + 1).split('&');
        for (var i = 0; i < hashes.length; i++) {
            hash = hashes[i].split('=');
            vars.push(hash[0]);
            vars[hash[0]] = hash[1];
        }
        return vars;
    }
};};

How to use.

如何使用。

var str = "http://en.wikipedia.org/wiki/Knopf?q=1&t=2";
var url = str.toUrl;

var host = url.host;
var domain = url.domain;
var original = url.original;
var relativePath = url.relativePath;
var paramQ = url.getParameter('q');
var paramT = url.getParamter('t');

#12


0  

For a certain purpose I did this quick Python function yesterday. It returns domain from URL. It's quick and doesn't need any input file listing stuff. However, I don't pretend it works in all cases, but it really does the job I needed for a simple text mining script.

为了某种目的,我昨天做了这个快速Python函数。它从URL返回域。它是快速的,不需要任何输入文件列表的东西。但是,我并不认为它在所有情况下都适用,但是它确实能够完成我需要的一个简单文本挖掘脚本的工作。

Output looks like this :

输出如下:

http://www.google.co.uk => google.co.uk
http://24.media.tumblr.com/tumblr_m04s34rqh567ij78k_250.gif => tumblr.com

http://www.google.co.uk = > google.co。英国http://24.media.tumblr.com/tumblr_m04s34rqh567ij78k_250.gif = > tumblr.com

def getDomain(url):    
        parts = re.split("\/", url)
        match = re.match("([\w\-]+\.)*([\w\-]+\.\w{2,6}$)", parts[2]) 
        if match != None:
            if re.search("\.uk", parts[2]): 
                match = re.match("([\w\-]+\.)*([\w\-]+\.[\w\-]+\.\w{2,6}$)", parts[2])
            return match.group(2)
        else: return ''  

Seems to work pretty well.
However, it has to be modified to remove domain extensions on output as you wished.

看起来效果不错。但是,它必须被修改,以按照您希望的方式删除输出的域扩展。

#13


0  

Use this (.)(.*?)(.) then just extract the leading and end points. Easy, right?

使用这个(.)(.*?)(.)然后只提取开头和结束点。容易,对吧?

#14


0  

  1. how is this

    这是如何

    =((?:(?:(?:http)s?:)?\/\/)?(?:(?:[a-zA-Z0-9]+)\.?)*(?:(?:[a-zA-Z0-9]+))\.[a-zA-Z0-9]{2,3}) (you may want to add "\/" to end of pattern

    =((?:?:(?:http)年代?)? \ / \ /)?(?(?:[a-zA-Z0-9]+)\ ?)*(吗?(?:[a-zA-Z0-9]+))\[a-zA-Z0-9]{ 2,3 })(您可能希望添加“\ \”模式的终结

  2. if your goal is to rid url's passed in as a param you may add the equal sign as the first char, like:

    如果您的目标是将url作为一个param传递进来,您可以将等号作为第一个字符添加,比如:

    =((?:(?:(?:http)s?:)?//)?(?:(?:[a-zA-Z0-9]+).?)*(?:(?:[a-zA-Z0-9]+)).[a-zA-Z0-9]{2,3}/)

    =((?:?:(?:http)年代?)? / /)?(?(?:[a-zA-Z0-9]+)。?)*(吗?(?:[a-zA-Z0-9]+))。[a-zA-Z0-9]{ 2,3 } /)

    and replace with "/"

    ,用“/”来代替

The goal of this example to get rid of any domain name regardless of the form it appears in. (i.e. to ensure url parameters don't incldue domain names to avoid xss attack)

这个示例的目标是去掉任何域名,而不考虑它出现的形式。(例如,要确保url参数不包含域名以避免xss攻击)

#15


-1  

#!/usr/bin/perl -w
use strict;

my $url = $ARGV[0];
if($url =~ /([^:]*:\/\/)?([^\/]*\.)*([^\/\.]+)\.[^\/]+/g) {
  print $3;
}

#16


-1  

/^(?:https?:\/\/)?(?:www\.)?([^\/]+)/i

#17


-1  

Just for knowledge:

只是为了知识:

'http://api.livreto.co/books'.replace(/^(https?:\/\/)([a-z]{3}[0-9]?\.)?(\w+)(\.[a-zA-Z]{2,3})(\.[a-zA-Z]{2,3})?.*$/, '$3$4$5');

# returns livreto.co