从给定的URL获取域名

Given a URL, I want to extract domain name(It should not include 'www' part). Url can contain http/https. Here is the java code that I wrote. Though It seems to work fine, is there any better approach or are there some edge cases, that could fail.

给定一个URL，我想提取域名（它不应该包含'www'部分）。网址可以包含http / https。这是我写的java代码。虽然它似乎工作正常，有没有更好的方法或有一些边缘情况，可能会失败。

public static String getDomainName(String url) throws MalformedURLException{
    if(!url.startsWith("http") && !url.startsWith("https")){
         url = "http://" + url;
    }        
    URL netUrl = new URL(url);
    String host = netUrl.getHost();
    if(host.startsWith("www")){
        host = host.substring("www".length()+1);
    }
    return host;
}

Input: http://google.com/blah

输入：http：//google.com/blah

Output: google.com

输出：google.com

9 个解决方案

#1

214

If you want to parse a URL, use java.net.URI. java.net.URL has a bunch of problems -- its equals method does a DNS lookup which means code using it can be vulnerable to denial of service attacks when used with untrusted inputs.

如果要解析URL，请使用java.net.URI。 java.net.URL有一堆问题 - 它的equals方法执行DNS查找，这意味着使用它的代码在与不受信任的输入一起使用时可能容易受到拒绝服务攻击。

"Mr. Gosling -- why did you make url equals suck?" explains one such problem. Just get in the habit of using java.net.URI instead.

“戈斯林先生 - 你为什么让网址等于吮吸？”解释了这样一个问题。只是养成使用java.net.URI的习惯。

public static String getDomainName(String url) throws URISyntaxException {
    URI uri = new URI(url);
    String domain = uri.getHost();
    return domain.startsWith("www.") ? domain.substring(4) : domain;
}

should do what you want.

应该做你想做的事。

Though It seems to work fine, is there any better approach or are there some edge cases, that could fail.

虽然它似乎工作正常，有没有更好的方法或有一些边缘情况，可能会失败。

Your code as written fails for the valid URLs:

您编写的代码无法使用有效的URL：

httpfoo/bar -- relative URL with a path component that starts with http.
httpfoo / bar - 具有以http开头的路径组件的相对URL。
HTTP://example.com/ -- protocol is case-insensitive.
HTTP://example.com/ - 协议不区分大小写。
//example.com/ -- protocol relative URL with a host
//example.com/ - 与主机的协议相对URL
www/foo -- a relative URL with a path component that starts with www
www / foo - 具有以www开头的路径组件的相对URL
wwwexample.com -- domain name that does not starts with www. but starts with www.
wwwexample.com - 不以www开头的域名。但是从www开始。

Hierarchical URLs have a complex grammar. If you try to roll your own parser without carefully reading RFC 3986, you will probably get it wrong. Just use the one that's built into the core libraries.

分层URL具有复杂的语法。如果你试图在没有仔细阅读RFC 3986的情况下推出自己的解析器，你可能会弄错它。只需使用内置于核心库中的那个。

If you really need to deal with messy inputs that java.net.URI rejects, see RFC 3986 Appendix B:

如果您确实需要处理java.net.URI拒绝的混乱输入，请参阅RFC 3986附录B：

Appendix B. Parsing a URI Reference with a Regular Expression

As the "first-match-wins" algorithm is identical to the "greedy" disambiguation method used by POSIX regular expressions, it is natural and commonplace to use a regular expression for parsing the potential five components of a URI reference.

由于“first-match-wins”算法与POSIX正则表达式使用的“贪婪”消歧方法相同，因此使用正则表达式解析URI引用的潜在五个组件是很自然和平常的。

The following line is the regular expression for breaking-down a well-formed URI reference into its components.

以下行是用于将格式正确的URI引用分解为其组件的正则表达式。
  ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
   12            3  4          5       6  7        8 9
The numbers in the second line above are only to assist readability; they indicate the reference points for each subexpression (i.e., each paired parenthesis).

上面第二行中的数字只是为了提高可读性;它们表示每个子表达的参考点（即每个配对的括号）。

#2

import java.net.*;
import java.io.*;

public class ParseURL {
  public static void main(String[] args) throws Exception {

    URL aURL = new URL("http://example.com:80/docs/books/tutorial"
                       + "/index.html?name=networking#DOWNLOADING");

    System.out.println("protocol = " + aURL.getProtocol()); //http
    System.out.println("authority = " + aURL.getAuthority()); //example.com:80
    System.out.println("host = " + aURL.getHost()); //example.com
    System.out.println("port = " + aURL.getPort()); //80
    System.out.println("path = " + aURL.getPath()); //  /docs/books/tutorial/index.html
    System.out.println("query = " + aURL.getQuery()); //name=networking
    System.out.println("filename = " + aURL.getFile()); ///docs/books/tutorial/index.html?name=networking
    System.out.println("ref = " + aURL.getRef()); //DOWNLOADING
  }
}

#3

Here is a short and simple line using InternetDomainName.topPrivateDomain() in Guava: InternetDomainName.from(new URL(url).getHost()).topPrivateDomain().toString()

这是一个使用Guava中的InternetDomainName.topPrivateDomain（）的简短行：InternetDomainName.from（new URL（url）.getHost（））。topPrivateDomain（）。toString（）

Given http://www.google.com/blah, that will give you google.com. Or, given http://www.google.co.mx, it will give you google.co.mx.

鉴于http://www.google.com/blah，这将为您提供google.com。或者，给予http://www.google.co.mx，它会给你google.co.mx。

As Sa Qada commented in another answer on this post, this question has been asked earlier: Extract main domain name from a given url. The best answer to that question is from Satya, who suggests Guava's InternetDomainName.topPrivateDomain()

正如Sa Qada在这篇文章的另一个答案中评论的那样，之前已经提出了这个问题：从给定的URL中提取主域名。这个问题的最佳答案来自Satya，他建议Guava的InternetDomainName.topPrivateDomain（）

public boolean isTopPrivateDomain()

public boolean isTopPrivateDomain（）

Indicates whether this domain name is composed of exactly one subdomain component followed by a public suffix. For example, returns true for google.com and foo.co.uk, but not for www.google.com or co.uk.

指示此域名是否仅由一个子域组件后跟公共后缀组成。例如，google.com和foo.co.uk返回true，但www.google.com或co.uk不返回true。

Warning: A true result from this method does not imply that the domain is at the highest level which is addressable as a host, as many public suffixes are also addressable hosts. For example, the domain bar.uk.com has a public suffix of uk.com, so it would return true from this method. But uk.com is itself an addressable host.

警告：此方法的真实结果并不意味着域位于可作为主机寻址的*别，因为许多公共后缀也是可寻址的主机。例如，域bar.uk.com的公共后缀为uk.com，因此它将从此方法返回true。但是uk.com本身就是一个可寻址的主机。

This method can be used to determine whether a domain is probably the highest level for which cookies may be set, though even that depends on individual browsers' implementations of cookie controls. See RFC 2109 for details.

此方法可用于确定域是否可能是可以设置cookie的*别，但即使这取决于各个浏览器的cookie控件实现。有关详细信息，请参阅RFC 2109。

Putting that together with URL.getHost(), which the original post already contains, gives you:

将它与原始帖子已包含的URL.getHost（）放在一起，可以为您提供：

import com.google.common.net.InternetDomainName;

import java.net.URL;

public class DomainNameMain {

  public static void main(final String... args) throws Exception {
    final String urlString = "http://www.google.com/blah";
    final URL url = new URL(urlString);
    final String host = url.getHost();
    final InternetDomainName name = InternetDomainName.from(host).topPrivateDomain();
    System.out.println(urlString);
    System.out.println(host);
    System.out.println(name);
  }
}

#4

I wrote a method (see below) which extracts a url's domain name and which uses simple String matching. What it actually does is extract the bit between the first "://" (or index 0 if there's no "://" contained) and the first subsequent "/" (or index String.length() if there's no subsequent "/"). The remaining, preceding "www(_)*." bit is chopped off. I'm sure there'll be cases where this won't be good enough but it should be good enough in most cases!

我写了一个方法（见下文），它提取了一个url的域名，并使用简单的String匹配。它实际上做的是在第一个“：//”（或索引0，如果没有包含“：//”）和第一个后续“/”（或索引String.length（），如果没有后续“）之间提取位/“）。剩下的，前面的“www（_）*。”比特被砍掉了。我确信会出现这种情况不够好的情况，但在大多数情况下它应该足够好了！

Mike Samuel's post above says that the java.net.URI class could do this (and was preferred to the java.net.URL class) but I encountered problems with the URI class. Notably, URI.getHost() gives a null value if the url does not include the scheme, i.e. the "http(s)" bit.

Mike Samuel上面的帖子说java.net.URI类可以做到这一点（并且比java.net.URL类更受欢迎）但我遇到了URI类的问题。值得注意的是，如果url不包括该方案，即“http（s）”位，则URI.getHost（）给出空值。

/**
 * Extracts the domain name from {@code url}
 * by means of String manipulation
 * rather than using the {@link URI} or {@link URL} class.
 *
 * @param url is non-null.
 * @return the domain name within {@code url}.
 */
public String getUrlDomainName(String url) {
  String domainName = new String(url);

  int index = domainName.indexOf("://");

  if (index != -1) {
    // keep everything after the "://"
    domainName = domainName.substring(index + 3);
  }

  index = domainName.indexOf('/');

  if (index != -1) {
    // keep everything before the '/'
    domainName = domainName.substring(0, index);
  }

  // check for and remove a preceding 'www'
  // followed by any sequence of characters (non-greedy)
  // followed by a '.'
  // from the beginning of the string
  domainName = domainName.replaceFirst("^www.*?\\.", "");

  return domainName;
}

#5

I made a small treatment after the URI object creation

我在URI对象创建后做了一个小处理

 if (url.startsWith("http:/")) {
        if (!url.contains("http://")) {
            url = url.replaceAll("http:/", "http://");
        }
    } else {
        url = "http://" + url;
    }
    URI uri = new URI(url);
    String domain = uri.getHost();
    return domain.startsWith("www.") ? domain.substring(4) : domain;

#6

try this one : java.net.URL;
JOptionPane.showMessageDialog(null, getDomainName(new URL("https://en.wikipedia.org/wiki/List_of_Internet_top-level_domains")));

试试这个：java.net.URL; JOptionPane.showMessageDialog（null，getDomainName（new URL（“https://en.wikipedia.org/wiki/List_of_Internet_top-level_domains”）））;

public String getDomainName(URL url){
String strDomain;
String[] strhost = url.getHost().split(Pattern.quote("."));
String[] strTLD = {"com","org","net","int","edu","gov","mil","arpa"};

if(Arrays.asList(strTLD).indexOf(strhost[strhost.length-1])>=0)
    strDomain = strhost[strhost.length-2]+"."+strhost[strhost.length-1];
else if(strhost.length>2)
    strDomain = strhost[strhost.length-3]+"."+strhost[strhost.length-2]+"."+strhost[strhost.length-1];
else
    strDomain = strhost[strhost.length-2]+"."+strhost[strhost.length-1];
return strDomain;}

#7

There is a similar question Extract main domain name from a given url. If you take a look at this answer , you will see that it is very easy. You just need to use java.net.URL and String utility - Split

有一个类似的问题从给定的URL中提取主域名。如果你看看这个答案，你会发现这很容易。您只需要使用java.net.URL和String实用程序 - Split

#8

If the input url is user input. this method gives the most appropriate host name. if not found gives back the input url.

如果输入url是用户输入。此方法提供最合适的主机名。如果没有找到则返回输入网址。

private String getHostName(String urlInput) {
        urlInput = urlInput.toLowerCase();
        String hostName=urlInput;
        if(!urlInput.equals("")){
            if(urlInput.startsWith("http") || urlInput.startsWith("https")){
                try{
                    URL netUrl = new URL(urlInput);
                    String host= netUrl.getHost();
                    if(host.startsWith("www")){
                        hostName = host.substring("www".length()+1);
                    }else{
                        hostName=host;
                    }
                }catch (MalformedURLException e){
                    hostName=urlInput;
                }
            }else if(urlInput.startsWith("www")){
                hostName=urlInput.substring("www".length()+1);
            }
            return  hostName;
        }else{
            return  "";
        }
    }

#9

private static final String hostExtractorRegexString = "(?:https?://)?(?:www\\.)?(.+\\.)(com|au\\.uk|co\\.in|be|in|uk|org\\.in|org|net|edu|gov|mil)";
private static final Pattern hostExtractorRegexPattern = Pattern.compile(hostExtractorRegexString);

public static String getDomainName(String url){
    if (url == null) return null;
    url = url.trim();
    Matcher m = hostExtractorRegexPattern.matcher(url);
    if(m.find() && m.groupCount() == 2) {
        return m.group(1) + m.group(2);
    }
    else {
        return null;
    }
}

Explanation : The regex has 4 groups. The first two are non-matching groups and the next two are matching groups.

说明：正则表达式有4个组。前两个是不匹配的组，接下来的两个是匹配的组。

The first non-matching group is "http" or "https" or ""

第一个不匹配的组是“http”或“https”或“”

The second non-matching group is "www." or ""

第二个不匹配的组是“www”。要么 ””

The second matching group is the top level domain

第二个匹配组是*域

The first matching group is anything after the non-matching groups and anything before the top level domain

第一个匹配组是非匹配组之后的任何内容以及*域之前的任何内容

The concatenation of the two matching groups will give us the domain/host name.

两个匹配组的串联将为我们提供域/主机名。

PS : Note that you can add any number of supported domains to the regex.

PS：请注意，您可以向正则表达式添加任意数量的受支持域。

#1

214